Proceedings of IC2IT2012
description
Transcript of Proceedings of IC2IT2012
i
Table of Contents
Message from KMUTNB President ………………………………………………… ii Message from General Chair ……………………………………………………….. iii Conference Organizers ……………………………………………………………… iv Conference Organization Committee ………………………………………………. v Technical Program Committee ……………………………………………………... vi Keynote Speaker ……………………………………………………………………. vii Technical Program Contents ………………………………………………………... x Invited Papers …………………………………………………………………..…… 1 Regular Papers ……………………………………………………………………… 8 Author Index ………………………………………………………………………... 186
The Eighth International Conference on Computing and Information Technology IC2IT 2012
ii
Message from KMUTNB President
Nowadays, it is generally accepted that the development of a nation stems from technical advancements and this has become the main key factor which dictates the development of any country. Many issues can affect development such as international economics, highly competitive markets, social and cultural differences, and global environmental problems. Thus, strengthening a country’s capability to gain knowledge both physically and mentally provides the ability to deal with any critical issues, not only in an urban society but also surrounding countryside societies too. To ensure the overall stability of a country in the long term, technology is considered one of the most important mechanisms to support the management of educational quality and to provide the ability to continually improve it. This can be seen from many civilized countries that have invested resources and fundamental infrastructures for the deployment of Information Technology to their education system. The development of innovative teaching and learning that focuses on bringing Information Technology to the forefront of education benefits the population as a whole. The main goal is to improve the quality of life and progress evenly and equally to a better future. I would like to say a special thank you to everybody involved in this conference, from partners to stakeholders, without you this would not be possible. And I hope this conference provides a good opportunity for all your voices to be heard. (Professor Dr.Teravuti Boonyasopon) President, King Mongkut’s University of Technology North Bangkok
The Eighth International Conference on Computing and Information Technology IC2IT 2012
iii
Message from General Chair
Some of the most dramatic changes in the world are caused by the development of technology which continually changes our daily lives. The need for education, research, and development is necessary to understand the modem world we are now a part of. In particular, the computer and information technology field. The Faculty of Information Technology, KMUTNB will hold the 8th Conference in Computing and Information Technology between the 9th-10th of May 2012 at the Dusit Thani Hotel, Pattaya City, with the goal to serve as a platform to publish the findings of academic research in the field of Computers and Information Technology from students, professors, researchers, and general public. By the cooperation of local and international institutions including Fern University in Hagen (Germany), Oklahoma State University (USA), Chemnitz University of Technology (Germany), Edith Cowan University (Australia), National Taiwan University (Taiwan), Hanoi National University of Education (Vietnam), Nakhon Pathom Rajabhat University, Kanchanaburi Rajabhat University, Siam University, and Ubol Ratchathani University. Thank you to the president, CEO of King Mongkut’s University of Technology North Bangkok, and all involved organizations and committees who support and drive this conference to be successful.
(Associate Professor Dr.Monchai Tiantong)
General Chair
The Eighth International Conference on Computing and Information Technology IC2IT 2012
iv
Conference Organizers
King Mongkut’s University of Technology North Bangkok, Thailand
Fern University in Hagen, Germany
Oklahoma State University, USA
Edith Cowan University, Australia
National Taiwan University, Taiwan
Hanoi National University of Education, Vietnam
Mahasarakham University, Thailand
Kanchanaburi Rajabhat University, Thailand
Siam University, Thailand
Nakhon Pathom Rajabhat University, Thailand
Ubon Ratchathani University, Thailand
Chemnitz University of Technology, Germany
The Eighth International Conference on Computing and Information Technology IC2IT 2012
v
Conference Organization Committee
General Chair : Assoc.Professor Dr.Monchai Tiantong King Mongkut’s University of Technology North Bangkok Technical Program Chair : Prof.Dr.Herwig Unger Fern University in Hagen, Germany Conference Treasurer : Assist.Prof.Dr.Supot Nitsuwat King Mongkut’s University of Technology North Bangkok Secretary and Publication Chair : Assist.Prof.Dr.Phayung Meesad King Mongkut’s University of Technology North Bangkok
The Eighth International Conference on Computing and Information Technology IC2IT 2012
vi
Technical Program Committee
Prasong Praneetpolgrang, SPU, Thailand Roman Gumzej, University of Maribor, Slovenia Saowaphak Sasanus, TOT, Thailand Sirapat Boonkrong, KMUTNB, Thailand Somchai Prakarncharoen, KMUTNB, Thailand Soradech Krootjohn, KMUTNB, Thailand Suksaeng Kukanok, Thailand Surapan Yimman, KMUTNB, Thailand Sumitra Nuanmeesri, SSRU, Thailand Sunantha Sodsee, KMUTNB, Thailand Supot Nitsuwat, KMUTNB, Thailand Taweesak Ganjanasuwan, Thailand Tong Srikhacha, TOT, Thailand Tossaporn Joochim, UBU, Thailand Thibault Bernard, Uni Reims, France Thippaya Chintakovid, KMUTNB, Thailand Tobias Eggendorfer, Hamburg, Germany Thomas Böhme, TU Ilmenau, Germany Thomas Tilli, Telecom, Germany Dang Hung Tran, HNUE, Vietnam Ulrike Lechner, UniBw, Germany Uraiwan Inyaem, RMUTT, Thailand Wallace Tang, CityU Hongkong Wolfram Hardt, Chemnitz, Germany Winai Bodhisuwan, KU, Thailand Wongot Sriurai, UBU, Thailand Woraniti Limpakorn, TOT, Thailand
Alain Bui, Uni Paris 8, France Alisa Kongthon, NECTEC, Thailand Anirach Mingkhwan, KMUTNB, Thailand Apiruck Preechayasomboon, TOT, Thailand Armin Mikler, University of North Texas, USA Atchara Masaweerawat, UBU, Thailand Banatus Soiraya, Thailand Bogdan Lent, Lent AG, Switzerland Chatchawin Namman, UBU, Thailand Chayakorn Netramai, KMUTNB, Thailand Cholatip Yawut, KMUTNB, Thailand Choochart Haruechaiyasa, NECTEC, Thailand Claudio Ramirez, USL, Mexico Craig Valli, ECU, Australia Dietmar Tutsch, Wuppertal, Germany Doy Sundarasaradula, TOT, Thailand Dursun Delen, OSU, USA Gerald Eichler, Telecom, Germany Gerald Quirchmayr, UNIVIE, Austria Hsin-mu Tsai, NTU, Taiwan Ho Cam Ha, HNUE, Vietnam Jamornkul Laokietkul,CRU, Thailand Janusz Kacprzyk, Polish Academy of Science, Poland Jie Lu, Univ. of Technology, Sydney, Australia Kairung Hengpraphrom, NPRU, Thailand Kamol Limtunyakul, KMUTNB, Thailand Kriengsak Treeprapin, UBU, Thailand Kunpong Voraratpunya, KMITL, Thailand Kyandoghere Kyamakya, Klagenfurt, Austria Maleerat Sodanil, KMUTNB, Thailand Marco Aiello, Groningen, The Netherlands Mark Weiser, OSU, USA Martin Hagan, OSU, USA Mirko Caspar, Chemnitz, Germany Nadh Ditcharoen, UBU, Thailand Nawaporn Visitpongpun, KMUTNB, Thailand Nattavee Utakrit, KMUTNB, Thailand Nguyen The Loc, HNUE, Vietnam Nalinpat Porrawatpreyakorn, UNIVIE, Austria Padej Phomsakha Na Sakonnakorn, Thailand Parinya Sanguansat, PIM, Thailand Passakon Prathombutr, NECTEC, Thailand Peter Kropf, Neuchatel, Switzerland Phayung Meesad, KMUTNB, Thailand
The Eighth International Conference on Computing and Information Technology IC2IT 2012
vii
Keynote Speaker
Professor Dr. Martin Hagan School of Electrical and Computer Engineering
Oklahoma State University, USA
Topic : Dynamic Neural Networks : What Are They, and How Can We Use Them? Abstract : Neural networks can be classified into static and dynamic categories. In static networks, which are more commonly used, the output of the network is computed uniquely from the current inputs to the network. In dynamic networks, the output is also a function of past inputs, outputs or states of the network. This talk will address the theory and applications of this interesting class of neural network. Dynamic networks have memory, and therefore they can be trained to learn sequential or time-varying patterns. This has applications in such disparate areas as control systems, prediction in financial markets, channel equalization in communication systems, phase detection in power systems, sorting, fault detection, speech recognition, and even the prediction of protein structure in genetics. These dynamic networks are generally trained using gradient-based (steepest descent, conjugate gradient, etc.) or Jacobian-based (Gauss-Newton, Levenberg-Marquardt, Extended Kalman filter, etc.) optimization algorithms. The methods for computing the gradients and Jacobians fall generally into two categories: real time recurrent learning (RTRL) or backpropagation through time (BPTT). In this talk we will present a unified view of the training of dynamic networks. We will begin with a very general framework for representing dynamic networks and will demonstrate how BPTT and RTRL algorithms can be efficiently developed using this framework. While dynamic networks are more powerful than static networks, it has been known for some time that they are more difficult to train. In this talk, we will also investigate the error surfaces for these dynamic networks, which will provide interesting insights into the difficulty of dynamic network training.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
viii
Keynote Speaker
Prof. Dr. rer.nat. Ulrike Lechner Institut für Angewandte Informatik Fakultät für Informatik Universität der Bundeswehr München Germany
Topic : Innovation Management and the IT-Industry – Enabler of innovations or truly innovative? Abstract : Who wants to be innovative? Who needs to be innovative? Everybody? - Innovation seems to be paramount in today´s economy and IT is an important driver for innovation. Think of EBusiness and all the consumer electronics. Can it be safely assumed that this industry is innovative and masters the art and science of innovation management? What about important business model trends of today, outsourcing and cloud computing, or the bread-and-butter business of the many IT-consulting companies? How important is innovation to them and how do they master innovations and do innovation management? Empirical data is rather inconclusive about business model innovations of the IT-industry and the need to be innovative. The talk reports on experiences in innovation management in the IT-industry and discusses awareness of innovation, innovation in services vs. product innovations and the various options to design innovation management. It provides an overview of the innovation landscape and ecosystems in the IT-industry as well as the theoretical background to analyze innovation networks. The talk discusses scientific approaches and open questions in this field.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
ix
Keynote Speaker
Dr. Hsin-Mu Tsai Department of Computer Science and Information Engineering
National Taiwan University, Taiwan
Topic : Extend the Safety Shield - Building the Next Generation Vehicle Safety System Abstract : For the past decade, various safety systems have been realized by the car manufacturers to reduce the number of accidents drastically. However, the conventional approach has limited the classes of risks which can be detected and handled by the safety systems to those which have a line-of-sight path to the sensors installed on vehicles. In this talk, I will propose the next-generation vehicle safety system, which utilizes two fundamental technologies. The system gives out warnings for the vehicle or the driver to react to potential risks in a timely manner and extends the classes of risks which can be detected by vehicles from only “risks which have appeared” to also “risks not yet appear”. Conceptually, this increases the size of the “safety shield” of the vehicle, since most accidents caused by detectable risks could be avoided. I will also present the related research challenges in implementing such a system and some preliminary results from the measurements we carried out at National Taiwan University.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
x
Technical Program Contents
The Eighth International Conference on Computing and Information Technology IC2IT 2012
xi
Wednesday May 9, 2012
8:00-9:00 Registration
9:00-9:30 Opening Ceremony by Prof.Dr. Teravuti Boonyasopon, President of King Mongkut’s University of Technology North Bangkok
9:30-10:30
Invited Keynote Speech by Prof.Dr. Martin Hagan, Oklahoma State University, USA Topic: Dynamic Neural Network: What are they, and How can we use them?
10:30-11:00 Coffee Break
11:00-12:00
Invited Keynote Speech by Prof.Dr.rer.nat. Ulrike Lechner, Universität der Bundeswehr München, Germany Topic: Innovation Management and The IT-Industry-Enabler of innovations or truly innovative?
12:00-13:00 Lunch
13:00-18:00 Parallel Session Presentation
18:00-22:00 Welcome Dinner
IC2IT 2012 Session I
Network & Security and Fuzzy Logic
Chair Session: Dr. Nawaporn Wisitpongphan
Time/Paper-ID Title/Author Page
13:00-13:20 IC2IT2012-71
Improving VPN Security Performance Based on One-Time Password Technique Using Quantum Keys Montida Pattaranantakul, Paramin Sangwongngam, and Keattisak Sripimanwat
8
13:20-13:40 IC2IT2012-33
Experimental Results on the Reloading Wave Mechanism for Randomized Token Circulation Boukary Ouedraogo, Thibault Bernard, and Alain Bui
14
13:40-14:00 IC2IT2012-107
Statistical-Based Car Following Model for Realistic Simulation of Wireless Vehicular Networks Kitipong Tansriwong and Phongsak Keeratiwintakorn
19
14:00-14:20 IC2IT2012-34
Rainfall Prediction in the Northeast Region of Thailand Using Cooperative Neuro-Fuzzy Technique Jesada Kajornrit, Kok Wai Wong, and Chun Che Fung
24
14:20-14:40 IC2IT2012-46
Interval-Valued Intuitionistic Fuzzy ELECTRE Method Ming-Che Wu and Ting-Yu Chen 30
14:40-15:00 Coffee Break
The Eighth International Conference on Computing and Information Technology IC2IT 2012
xii
IC2IT 2012 Session II
Fuzzy Logic, Neural Network, and Recommendation Systems
Chair Session: Dr. Maleerat Sodanil
Time/Paper-ID Title/Author Page
15.00-15.20 IC2IT2012-81
Optimizing of Interval Type-2 Fuzzy Logic Systems Using Hybrid Heruistic Algorithm Evaluated by Classification Adisak Sangsongfa and Phayung Meesad
36
15.20-15.40 IC2IT2012-60
Neural Network Modeling for an Intelligent Recommendation System Supporting SRM for Universities in Thailand Kanokwan Kongsakun, Jesada Kajornrit, and Chun Che Fung
42
15.40-16.00 IC2IT2012-44
Recommendation and Application of Fault Tolerance Patterns to Services Tunyathorn Leelawatcharamas and Twittie Senivongse
48
16.00-16.20 IC2IT2012-43
Development of Experience Base Ontology to Increase Competency of Semi-Automated ICD-10-TM Coding System Wansa Paoin and Supot Nitsuwat
54
16:20-16:30 Break
IC2IT 2012 Session III
Natural Language Processing and Machine Translation
Chair Session: Dr. Maleerat Sodanil
16:30-16:50 IC2IT2012-110
Collacation-Based Term Prediction for Academic Writing Narisara Nakmaetee, Maleerat Sodanil, and Choochart Haruechaiyasak
58
16:50-17:10 IC2IT2012-65
Thai Poetry in Machine Translation Sajjaporn Waijanya and Anirach Mingkhwan 64
17:10-17:30 IC2IT2012-45
Keyword Recommendation for Academic Publication Using Flexible N-gram Rugpong Grachangpun, Maleerat Sodanil, and Choochart Haruechaiyasak
70
17:30-17:50 IC2IT2012-70
Using Example-Based Machine Translation for English – Vietnamese Translation Minh Quang Nguyen, Dang Hung Tran, and Thi Anh Le Pham
75
18:00-22:00 Welcome Dinner
The Eighth International Conference on Computing and Information Technology IC2IT 2012
xiii
Thursday May 10, 2012
8:00-9:00 Registration
9:00-10:00
Invited Keynote Speech by Dr. Hsin-Mu Tsai, National Taiwan University, Taiwan Topic: Extend the Safety Shield - Building the Next Generation Vehicle Safety System
10:00-10:20 Coffee Break
10:20-12:00 Parallel Session Presentation
12:00-13:00 Lunch
13:00-18:00 Parallel Session Presentation
IC2IT 2012 Session IV
Image Processing, Web Mining, Clustering, and e-Business
Chair Session: Prof.Dr. Herwig Unger
Time/Paper-ID Title/Author Page
10:20-10:40 IC2IT2012-57
Cross-Ratio Analysis for Building up the Robustness of Document Image Watermark Wiyada Yawai and Nualsawat Hiransakolwong
81
10:40-11:00 IC2IT2012-73
PCA Based Handwritten Character Recognition System Using Support Vector Machine & Neural Network Ravi Sheth and Kinjal Mehta
87
11:00-11:20 IC2IT2012-68
Web Mining Using Concept-Based Pattern Taxonomy Model Sheng-Tang Wu, Yuefeng Li, and Yung-Chang Lin 92
11:20-11:40 IC2IT2012-59
A New Approach to Cluster Visualization Methods Based on Self-Organizing Maps Marcin Zimniak, Johannes Fliege, and Wolfgang Benn
98
11:40-12:00 IC2IT2012-74
Detecting Source Topics Using Extended HITS Mario Kubek and Herwig Unger 104
12:00-13:00 Lunch
The Eighth International Conference on Computing and Information Technology IC2IT 2012
xiv
IC2IT 2012 Session V
Evolutionary Algorithm, Heuristic Search, and Graphics Processing & Representation
Chair Session: Dr. Sunantha Sodsee
Time/Paper-ID Title/Author Page
13:00-13:20 IC2IT2012-91
Blended Value Based e-Business Modeling Approach: A Sustainable Approach Using QFD Mohammed Dewan and Mohammed Quaddus
109
13:20-13:40 IC2IT2012-94
Protein Structure Prediction in 2D Triangular Lattice Model Using Differential Evolution Algorithm Aditya Narayan Hati, Nanda Dulal Jana, Sayantan Mandal, and Jaya Sil
116
13:40-14:00 IC2IT2012-48
Elimination of Materializations from Left/Right Deep Data Integration Plans Janusz Getta
121
14:00-14:20 IC2IT2012-24
A Variable Neighbourhood Search Heuristic for the Design of Codes Roberto Montemanni, Matteo Salani, Derek H. Smith, and Francis Hunt
127
14:20-14:40 IC2IT2012-63
Spatial Join with R-Tree on Graphics Processing Units Tongjai Yampaka and Prabhas Chongstitwattana 133
14:40-15:00 Coffee Break
IC2IT 2012 Session VI
Web Services, and Ontology, and Agents
Chair Session: Dr. Sucha smanchat
15:00-15:20 IC2IT2012-41
Ontology Driven Conceptual Graph Representation of Natural Language Supriyo Ghosh, Prajna Devi Upadhyay, and Animesh Dutta
138
15:20-15:40 IC2IT2012-88
Web Services Privacy Measurement Based on Privacy Policy and Sensitivity Level of Personal Information Punyaphat Chaiwongsa and Twittie Senivongse
145
15:40-16:00 IC2IT2012-64
Measuring Granularity of Web Services with Semantic Annotation Nuttida Muchalintamolee and Twittie Senivongse
151
16:00-16:20 IC2IT2012-83
Decomposing Ontology in Description Logics by Graph Partitioning Pham Thi Anh Le, Le Thanh Nhan, and Nguyen Minh Quang
157
16:20-16:30 Break
The Eighth International Conference on Computing and Information Technology IC2IT 2012
xv
Time/Paper-ID Title/Author Page
16:30-16:50 IC2IT2012-49
An Ontological Analysis of Common Research Interest for Researchers Nawarat Kamsiang and Twittie Senivongse
163
16:50-17:10 IC2IT2012-36
Automated Software Development Methodology: An Agent Oriented Approach Prajna Devi Upadhyay, Sudipta Acharya, and Animesh Dutta
169
17:10-17:30 IC2IT2012-53
Agent Based Computing Environment for Accessing Privileged Services Navin Agarwal and Animesh Dutta
176
17:30-17:50 IC2IT2012-52
An Interactive Multi-touch Teaching Innovation for Preschool Mathematical Skills Suparawadee Trongtortam, Peraphon Sophatsathit, and Achara Chandrachai
181
The Eighth International Conference on Computing and Information Technology IC2IT 2012
Dynamic Neural Networks: What Are They, and How Can We Use Them?
Martin Hagan School of Electrical and Computer Engineering,
Oklahoma State University, Stillwater, Oklahoma, 74078
Abstract—Neural networks can be classified into static and dynamic categories. In static networks, which are more commonly used, the output of the network is computed uniquely from the current inputs to the network. In dynamic networks, the output is also a function of past inputs, outputs or states of the network. This paper will address the theory and applications of this interesting class of neural network.
Dynamic networks have memory, and therefore they can be trained to learn sequential or time-varying patterns. This has applications in such disparate areas as control systems, prediction in financial markets, channel equalization in communication systems, phase detection in power systems, sorting, fault detection, speech recognition, and even the prediction of protein structure in genetics.
While dynamic networks are more powerful than static networks, it has been known for some time that they are more difficult to train. In this paper, we will also investigate the error surfaces for these dynamic networks, which will provide interesting insights into the difficulty of dynamic network training.
I. INTRODUCTION Dynamic networks are networks that contain delays (or
integrators, for continuous-time networks). These dynamic networks can have purely feedforward connections, or they can also have some feedback (recurrent) connections. Dynamic networks have memory. Their response at any given time will depend not only on the current input, but also on the history of the input sequence.
Because dynamic networks have memory, they can be trained to learn sequential or time-varying patterns. This has applications in such diverse areas as control of dynamic systems [1], prediction in financial markets [2], channel equalization in communication systems [3], phase detection in power systems [4], sorting [5], fault detection [6], speech recognition [7], learning of grammars in natural languages [8], and even the prediction of protein structure in genetics [9].
Dynamic networks can be trained using standard gradient-based or Jacobian-based optimization methods. However, the gradients and Jacobians that are required for these methods cannot be computed using the standard backpropagation algorithm. In this paper we will discuss a general dynamic network framework, in which dynamic backpropagation algorithms can be efficiently developed.
There are two general approaches (with many variations) to gradient and Jacobian calculations in dynamic networks: backpropagation-through-time (BPTT) [10] and real-time recurrent learning (RTRL) [11]. In the BPTT algorithm, the network response is computed for all time points, and then the gradient is computed by starting at the last time point and working backwards in time. This algorithm is computationally efficient for the gradient calculation, but it is difficult to implement on-line, because the algorithm works backward in time from the last time step.
In the RTRL algorithm, the gradient can be computed at the same time as the network response, since it is computed by starting at the first time point, and then working forward through time. RTRL requires more calculations than BPTT for calculating the gradient, but RTRL allows a convenient framework for on-line implementation. For Jacobian calculations, the RTRL algorithm is generally more efficient than the BPTT algorithm [12,13].
In order to more easily present general BPTT [10, 15] and RTRL [11, 14] algorithms, it will be helpful to introduce modified notation for networks that can have recurrent connections. In Section II, we will introduce this notation and will develop a general dynamic network framework. As a general rule, there have been two major approaches to using dynamic training. The first approach has been to use the general RTRL or BPTT concepts to derive algorithms for particular network architectures. The second approach has been to put a given network architecture into a particular canonical form (e.g., [16-18]), and then to use the dynamic training algorithm which has been previously designed for the canonical form. Our approach is to develop a very general framework in which to conveniently represent a large class of dynamic networks, and then to derive the RTRL and BPTT algorithms for the general framework
In Section III, we will demonstrate how this general dynamic framework can be applied to solve many real-world problems. Section IV will present procedures for computing gradients for the general framework. In this way, one computer code can be used to train arbitrarily constructed network architectures, without requiring that each architecture be first converted to a particular canonical form. Finally, Section V describes some complexities in the error surfaces of dynamic
The Eighth International Conference on Computing and Information Technology IC2IT 2012
1
networks, and shows how we can mitigate these complexities to achieve successful training for dynamic networks.
II. A GENERAL CLASS OF DYNAMIC NETWORK Our general dynamic network framework is called the
Layered Digital Dynamic Network (LDDN) [12]. The fundamental building block for the LDDN is the layer. A layer contains the following components:
• a set of weight matrices (input weights from external inputs, and layer weights from the outputs of other layers),
• tapped delay lines that appear at the input of a weight matrix,
• bias vector,
• summing junction,
• transfer function.
A prototype layer is shown in Fig. 1. The equations that define a layer response are
nm t( ) = IWm,l d( )pl t − d( )d∈DIm ,l∑
l∈Im∑
+ LWm,l d( )al t − d( )d∈DLm ,l∑
l∈L fm
∑ + bm (1)
am t( ) = f m nm t( )( ) (2)
where Im is the set of indices of all inputs that connect to layer
m, is the set of indices of all layers that connect forward to
layer m., is the lth input to the network, is the input
weight between input l and layer m, is the layer weight between layer l and layer m, is the bias vector for layer m, DLm,l is the set of all delays in the tapped delay line between Layer l and Layer m, DIm,l is the set of all delays in the tapped delay line between Input l and Layer m. For the LDDN class of networks, we can have multiple weight matrices associated with each layer - some coming from external inputs, and others coming from other layers. An example of a dynamic network in the LDDN framework is shown in Fig. 2.
The LDDN framework is quite general. It is equivalent to the class of general ordered networks discussed in [10] and [19]. It is also equivalent to the signal flow graph class of networks used in [15] and [20]. However, we can increase the generality of the LDDN further. In LDDNs, the weight matrix multiplies the corresponding vector coming into the layer (from an external input in the case of IW, and from another layer in the case of LW). This means that a dot product is formed between each row of the weight matrix and the input vector.
Figure 1. Example Layer
Figure 2. Example Dynamic Network in the LDDN Framework
We can consider more general weight functions than simply the dot product. For example, radial basis layers compute the distances between the input vector and the rows of the weight matrix. We can allow weight functions with arbitrary (but differentiable) operations between the weight matrix and the input vector. This enables us to include higher-order networks as part of our framework.
Another generality we can introduce is for the net input function. This is the function that combines the results of the weight function operations with the bias vector. In LDDNs, the net input function has been a simple summation. We can allow arbitrary, differentiable net input functions to be used.
The resulting network framework is the Generalized LDDN (GLDDN). A block diagram for a simple GLDDN (without delays) is shown in Fig. 3. The equations of operation for a GLDDN are
Weight Functions:
izm,l t,d( ) = ihm,l IWm,l d( ),pl t − d( )( ) (3)
lzm,l t,d( ) = lhm,l LWm,l d( ),al t − d( )( ) (4)
S1x 1
S2x 1
S3x 1
S1x 1
S2x 1 S
3x 1
S1x 1 S
2x 1 S
3x 1
R x 11
S1x R S
2x S
1S
3x S
2
S1
S2
S3
n1( )t
n2( )t n
3( )t
p1( )t
a1( )t
a2( )t a
3( )t
IW1,1
LW1,3
LW2,3
LW1,1
LW2,1
LW3,2
b1
b2
b31 1 1
R1
Inputs Layer 1 Layer 2 Layer 3
TDL
TDL
TDL
TDL
f1
f2
f3
The Eighth International Conference on Computing and Information Technology IC2IT 2012
2
Net Input Functions:
nm t( ) = om izm,l t,d( ) l∈Imd∈DIm ,l
, lzm,l t,d( ) l∈Lmf
d∈DLm ,l
,bm⎛
⎝⎜⎜
⎞
⎠⎟⎟
(5)
Transfer Functions:
am t( ) = f m nm t( )( ) (6)
Figure 3. Example Network with General Weight and Net Input Functions
III. APPLICATIONS OF DYNAMIC NETWORKS Dynamic networks have been applied to a wide variety of
application areas. In this section, we would like to give just a brief overview of some of these.
A. Phase Detection in Power Systems Voltage phase and local frequency deviation are used in
disturbance monitoring and control for power systems. Modern power electronic devices introduce complex interharmonics, which make it difficult to extract the phase. The dynamic neural network shown in Fig. 4 has been used [4] to detect phase in power systems. The input to the network is the line voltage:
p t( ) = A t( )sin 2π fct +φ t( )( ) + v t( ) .
The target output is the phase φ t( ) . The equations of operation for the network are
n1 t( ) = IW1,1 d( )p1 t − d( )d∈DIm ,l∑ +LW1,1 1( )a1 t −1( ) + b1
a1 t( ) = f1 n1 t( )( )
Figure 4. Phase Detection Network for Power Systems
B. Speech Prediction Predictive coding of speech signals is commonly used for
data compression. The standard method has used Linear Predictive Coding (LPC). Neural networks allow the use of nonlinear predictive coding. Fig. 5 shows a pipeline recurrent neural network [21], which can be used for speech prediction. The target output would be the next value of the input sequence.
Figure 5. Speech Predition Network
C. Channel Equalization The performance of a communication system can be
seriously impaired by channel effects and noise. These may cause the transmitted signal of one symbol to spread out and overlap successive symbol intervals - commonly termed Intersymbol Interference. Dynamic neural networks, like the one in Fig. 4, can be used to perform channel equalization, to compensate for the effects of the channel [3]. Fig. 6 shows the block diagram of such a system.
Figure 6. Channel Equalization System
D. Model Reference Control Dynamic networks are suitable for many types of control
systems. Fig. 7 shows the architecture of a model reference control system [14].
Figure 7. Model Reference Control System
S1x 1 S
2x 1
S1x 1 S
2x 1
S1x 1 S
2x 1
R x 11
S1
S2
n1
n2
p1
iz1,1
lz2,1
a1
a2
IW1,1
LW2,1
b1
b21 1
R1
f1
f2
ih1,1
lh2,1
o1
o2
r(t)
a3 (t)
1
1
n1(t)
n3(t)LW3,2
b1
IW1,1
b3
f2
f1
f3
TDL
LW1,2
y(t)TDL
LW1,4
TDL
LW3,4
TDL
LW4,3
b4
f4
1
a4 (t)
Neural Network Plant ModelNeural Network Controller
n4(t)
a2 (t)
1
LW2,1
b2
f2Plant
TDL
ep(t)
ec(t)
c(t)n2(t)
The Eighth International Conference on Computing and Information Technology IC2IT 2012
3
E. Grammatical Inference Grammars are a way to define languages. They consist of
rules that describe how to construct valid strings. Dynamic neural networks can be trained to recognize which strings belong to a language and which don’t. Dynamic networks can also perform grammatical inference - learning a grammar from example strings. Fig. 8 shows a dynamic network that can be used for grammatical inference [8]. The error function is defined by a single output neuron. At the end of each string presentation it should be 1 if the string is valid and 0 if not.
Figure 8. Grammar Inference Network
F. Protein Folding Each gene within the DNA molecule codes for a protein.
The amino acid sequence (A,T,G,C) determines the protein structure (e.g., secondary structure = helix, strand, coil). However, the relationship between the sequence and the structure is very complex. In the network in Fig. 9, the sequence is provided at the input to the network, and the output of the network indicates the secondary structure [9].
Figure 9. Protein Structure Identification Network
IV. GRADIENT CALCULATION FOR THE GLDDN Dynamic networks are generally trained with a gradient or
Jacobian-based algorithm. In this section we describe an algorithm for computing the gradient for the GLDDN. This can be done using the BPTT or the RTRL approaches. Because of limited space, we will describe only the RTRL algorithm in this paper. (Both approaches are described for the LDDN framework in [12].)
To explain the gradient calculation for the GLDDN, we must create certain definitions. We do that in the following paragraphs.
A. Preliminary Definitions First, as we stated earlier, a layer consists of a set of
weights, associated weight functions, associated tapped delay lines, a net input function, and a transfer function. The network has inputs that are connected to special weights, called input weights. The weights connecting one layer to another are called layer weights. In order to calculate the network response in stages, layer by layer, we need to proceed in the proper layer order, so that the necessary inputs at each layer will be available. This ordering of layers is called the simulation order. In order to backpropagate the derivatives for the gradient calculations, we must proceed in the opposite order, which is called the backpropagation order.
In order to simplify the description of the gradient calculation, some layers of the GLDDN will be assigned as network outputs, and some will be assigned as network inputs. A layer is an input layer if it has an input weight, or if it contains any delays with any of its weight matrices. A layer is an output layer if its output will be compared to a target during training, or if it is connected to an input layer through a matrix that has any delays associated with it.
For example, the LDDN shown in Fig. 2 has two output layers (1 and 3) and two input layers (1 and 2). For this network the simulation order is 1-2-3, and the backpropagation order is 3-2-1. As an aid in later derivations, we will define U as the set of all output layer numbers and X as the set of all input layer numbers. For the LDDN in Fig. 3, U=1,3 and X=1,2.
B. Gradient Calculation The objective of training is to optimize the network
performance, quantified in the performance index F(x), where x is a vector containing all of the weights and biases in the network. In this paper we will consider gradient-based algorithms for optimizing the performance (e.g., steepest descent, conjugate gradient, quasi-Newton, etc.). For the RTRL approach, the gradient is computed using
, (7)
where
∂au t( )∂xT
=∂eau t( )∂xT
+∂eau t( )∂nx t( )T
∂enx t( )∂a ′u t − d( )T
∂a ′u t − d( )∂xTd∈DLx , ′u
∑x∈X∑
′u ∈U∑
(8)
The superscript e in these expressions indicates an explicit derivative, not accounting for indirect effects through time.
Many of the terms in Eq. 8 will be zero and need not be included. To take advantage of these efficiencies, we introduce the following definitions
The Eighth International Conference on Computing and Information Technology IC2IT 2012
4
ELWU x( ) = u ∈U ∍ ∃ LWx,u ≠ 0( ) , (9)
ESX u( ) = x ∈X ∍ ∃ Sx,u ≠ 0( ) , (10)
where
(11)
is the sensitivity matrix.
Using these definitions, we can rewrite Eq. 8 as
∂au t( )∂xT
=∂eau t( )∂xT
+ Sx,u t( ) ∂enx t( )∂a ′u t − d( )T
∂a ′u t − d( )∂xTd∈DLx , ′u
∑′u ∈ELW
U x( )∑
x∈ESX u( )∑
(12)
The sensitivity matrix can be computed using static backpropagation, since it describes derivatives through a static portion of the network. The static backpropagation equation is
Su,m t( ) = Su,l t( ) ∂enl t( )
∂lz l ,m t,0( )T∂e lz l ,m t( )∂am t( )Tl∈Lb
m∩ES u( )∑
⎡
⎣⎢⎢
⎤
⎦⎥⎥Fm nm t( )( ) ,
u ∈U , (13)
where m is decremented from u through the backpropagation order, Lb
m is the set of indices of layers that are directly connected backwards to layer m (or to which layer m connects forward) and that contain no delays in the connection, and
. (14)
There are four terms in Eqs. 12 and 13 that need to be computed:
, , , and . (15)
The first term can be expanded as follows:
(16)
The first term on the right of Eq. 16 is the derivative of the net input function, which is the identity matrix if the net input is the standard summation. The second term is the derivative of the weight function, which is the corresponding weight matrix if the weight function is the standard dot product. Therefore, the right side of Eq. 16 becomes simply a weight matrix for LDDN networks.
The second term in Eq. 15 is the same as the first term on the right of Eq. 16. It is the derivative of the net input function. The third term in Eq. 15 is the same as the second term on the right of Eq. 16. It is the derivative of the weight function.
The final term that we need to compute is the last term in Eq. 15, which is the explicit derivative of the network outputs with respect to the weights and biases in the network. One element of that matrix can be written
(17)
The first term in this summation is an element of the sensitivity matrix, which is computed using Eq. 13. The second term is the derivative of the net input, and the third term is the derivative of the weight function. (We have made the assumption here that the net input function operates on each element individually.) Eq. 17 is the equation for an input weight. Layer weights and biases would have similar equations.
This completes the RTRL algorithm for networks that can be represented in the GLDDN framework. The main steps of the algorithm are Eqs. 7 and 12, where the components of Eq. 12 are computed using Eqs. 16 and 17. Computer code can be written from these equations, with modules for weight functions, net input functions and transfer functions added as needed. Each module should define the function response, as well as its derivative. The overall framework is independent of the particular form of these modules.
V. TRAINING DIFFICULTIES FOR DYNAMIC NETWORKS From the previous section on dynamic network
applications, it is clear that these types of networks are very powerful and have many uses. However, they have not yet been adopted comprehensively. The main reason for this is the difficulty in training these types of networks. The reasons for these difficulties are not completely understood, but it has been shown that one of the reasons is the existence of spurious valleys in the error surfaces of these networks. In this section, we will provide a quick overview of the causes of these spurious valleys and suggestions for mitigating their effects.
Fig. 10 shows an example of spurious valleys in the error surface of a neural network model reference controller (as shown in Fig. 7). In this particular example, the network had 65 weights. The plot shows the error surface along the direction of search in a particular iteration of a quasi-Newton optimization algorithm. It is clear from this profile that any standard line search, using a combination of interpolation and sectioning, will have great difficulty in locating the minimum along the search direction. There are many local minima contained in
The Eighth International Conference on Computing and Information Technology IC2IT 2012
5
very narrow valleys. In addition, the bottoms of the valleys are often cusps. Even if our line search were to locate the minimum, it is not clear that the minimum represents an optimal weight location. In fact, in the remainder of this section, we will demonstrate that spurious minima are introduced into the error surface due to characteristics of the input sequence.
Figure 10. Example of Spurious Valleys
In order to understand the spurious valleys in the error surfaces of dynamic networks it is best to start with the simplest network for which such valleys will appear. We have found that these valleys even appear in a linear network with one neuron, as shown in Fig. 11.
Figure 11. Single Neuron Recurrent Network
In order to generate an error surface, we first develop training data using the network of Fig. 11, where both weights are set to 0.5. We use a Gaussian white noise input sequence with mean zero and variance one for p(t), and then use the network to generate a sequence of outputs. In Fig. 12 we see a typical error surface, as the two weights are varied. Although this network architecture is simple, the error surfaces generated by these networks have spurious valleys similar to those encountered in more complicated networks.
The two valleys in the error surface occur for two different reasons. One valley occurs along the line w1=0. If this weight is zero, and the initial condition is zero, the output of the network will remain zero. Therefore, our mean squared error will be
constant and equal to the mean square value of the target outputs.
Figure 12. Single Neuron Network Error Surface
To understand where the second valley comes from, consider the network response equation:
a t +1( ) = w1p t( ) +w2a t( )
If we iterate this equation from the initial condition a(0), we get
a t( ) = w1p t( ) +w2p t −1( ) + w2( )2 p t − 2( ) +…
+ w2( )t−1 p 1( )+ w2( )t a 0( )
Here we can see that the response at time t is a polynomial in the parameter w2 . (It will be a polynomial of degree t-1, if the initial condition is zero.) The coefficients of the polynomial involve the input sequence and the initial condition. We obtain the second valley because this polynomial contains a root out- side the unit circle. There is some value of w2 that is larger than 1 in magnitude for which the output is almost zero.
Of course, having a single output close to zero would not produce a valley in the error surface. However, we discovered that once the polynomial shown above has a root outside the unit circle at time t, that same root also appears in the next polynomial at time t+1, and therefore, the output will remain small for all future times for the same weight value.
Fig. 13 shows a cross section of the error surface presented in Fig. 12 for w1=0.5 using different sequence lengths. The error falls abruptly near w2=-3.8239. That is the root of the polynomial described above. The root maintains its location as the sequence increases in length. This causes the valley in the error surface.
We have since studied more complex networks, with nonlinear transfer functions and multiple layers. The number of spurious valleys increases in these cases, and they become more complex. However the causes of the valleys remain similar. They are affected by initial conditions and roots of the
ï2 ï1 0 1 2
1
2
3
4
5
Distance Along Search Direction
Sum
Squ
ared
Err
or
ï10ï5
05
10 ï10
ï5
0
5
10
100
1020
w1
w2
Sum
Squ
are
Erro
r
The Eighth International Conference on Computing and Information Technology IC2IT 2012
6
input sequence (or subsequence). This leads to several procedures for improving the training for these networks.
The first training modification is to switch training
sequences often during training. If training is becoming trapped in a spurious valley, the valley will move when the training sequence is changed. Also, since some of the valleys are affected by the choice of initial condition, a second modification is to use small random initial conditions for neuron outputs and change them periodically during training. A further modification is to use a regularized performance index to force weights into the stable region. Since the deep valleys occur in regions where the network is unstable, we can avoid the valleys by maintaining a stable network. We generally decay the regularization factor during training, so that the final weights will not be biased.
VI. CONCLUSIONS Dynamic neural networks represent a very powerful
paradigm, and, as we have shown in this paper, they have a very wide variety of applications. However, they have not been as widely implemented as their power would suggest. The reason for this discrepancy is related to the difficulties in training these networks. The first obstacle in dynamic network training is the calculation of training gradients. In most cases, the gradient algorithm is custom designed for a specific network architecture, based on the general concepts of BPTT or RTRL. This creates a barrier to using dynamic networks. We propose a general dynamic network framework, the GLDDN, which encompasses almost all dynamic networks that have been proposed. This enables us to have a single code to calculate gradients for arbitrary networks, and reduces the initial barrier to using dynamic networks. The second obstacle to dynamic network training relates to the complexities of their error surfaces. We have described some of the mechanisms that cause these complexities – spurious valleys. We have also shown how to modify training algorithms to avoid these spurious valleys. We hope that these new developments will encourage the increased adoption of dynamic neural networks.
REFERENCES [1] Hagan, M., Demuth, H., De Jesús, O., “An Introduction to the Use of
Neural Networks in Control Systems,” invited paper, International
Journal of Robust and Nonlinear Control, Vol. 12, No. 11 (2002) pp. 959-985.
[2] Roman, J. and Jameel, A., “Backpropagation and recurrent neural networks in financial analysis of multiple stock market returns,” Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences, vol. 2, (1996) pp. 454-460.
[3] Feng, J., Tse, C.K., Lau, F.C.M., “A neural-network-based channel-equalization strategy for chaos-based communication systems,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 50, no. 7 ( 2003) pp. 954-957.
[4] Kamwa, I., Grondin, R., Sood, V.K., Gagnon, C., Nguyen, V. T., Mereb, J., “Recurrent neural networks for phasor detection and adaptive identification in power system control and protection,” IEEE Transactions on Instrumentation and Measurement, vol. 45, no. 2, (1996) pp. 657-664.
[5] Jayadeva and Rahman, S.A., “A neural network with O(N) neurons for ranking N numbers in O(1/N) time,” in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 51, no. 10, (2004) pp. 2044-2051.
[6] Chengyu, G. and Danai, K., “Fault diagnosis of the IFAC Benchmark Problem with a model-based recurrent neural network,” in Proceedings of the 1999 IEEE International Conference on Control Applications, vol. 2, (1999) pp. 1755-1760.
[7] Robinson, A.J., “An application of recurrent nets to phone probability estimation,” in IEEE Transactions on Neural Networks, vol. 5, no. 2 (1994).
[8] Medsker, L.R. and Jain, L.C., Recurrent neural networks: design and applications, Boca Raton, FL: CRC Press (2000).
[9] Gianluca, P., Przybylski, D., Rost, B., Baldi, P., “Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles,” in Proteins: Structure, Function, and Genetics, vol. 47, no. 2 , (2002) pp. 228-235.
[10] Werbos, P. J., “Backpropagation through time: What it is and how to do it,” Proceedings of the IEEE, vol. 78, (1990) pp. 1550–1560.
[11] Williams, R. J. and Zipser, D., “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, (1989) pp. 270–280.
[12] De Jesús, O., and Hagan, M., “Backpropagation Algorithms for a Broad Class of Dynamic Networks,” IEEE Transactions on Neural Networks, Vol. 18, No. 1 (2007) pp. 14 -27.
[13] De Jesús, O., Training General Dynamic Neural Networks, Doctoral Dissertation, Oklahoma State University, Stillwater OK, (2002).
[14] Narendra, K. S. and Parthasrathy, A. M., Identification and control for dynamic systems using neural networks, IEEE Transactions on Neural Networks, Vol. 1, No. 1 (1990) pp. 4-27.
[15] Wan, E. and Beaufays, F., “Diagrammatic Methods for Deriving and Relating Temporal Neural Networks Algorithms,” in Adaptive Processing of Sequences and Data Structures, Lecture Notes in Artificial Intelligence, Gori, M., and Giles, C.L., eds., Springer Verlag (1998).
[16] Dreyfus, G., Idan, Y., “The Canonical Form of Nonlinear Discrete-Time Models,” Neural Computation 10, 133–164 (1998).
[17] Tsoi, A. C., Back, A., “Discrete time recurrent neural network architectures: A unifying review,” Neurocomputing 15 (1997) 183-223.
[18] Personnaz, L. Dreyfus, G., “Comment on ‘Discrete-time recurrent neural network architectures: A unifying review’,” Neurocomputing 20 (1998) 325-331.
[19] Feldkamp, L.A. and Puskorius, G.V., “A signal processing framework based on dynamic neural networks with application to problems in adaptation, filtering, and classification,” Proceedings of the IEEE, vol. 86, no. 11 (1998) pp. 2259 - 2277.
[20] Campolucci, P., Marchegiani, A., Uncini, A., and Piazza, F., “Signal-Flow-Graph Derivation of On-line Gradient Learning Algorithms,” Proceedings of International Conference on Neural Networks ICNN'97 (1997) pp.1884-1889.
[21] Haykin, S. and L. Li, "Nonlinear adaptive prediction of nonstationary signals", IEEE Trans. Signal Process., vol. 43, no. 2, pp.526 -535 1995.
Lo
gS
um
Sq
uar
eE
rro
r
w2
The Eighth International Conference on Computing and Information Technology IC2IT 2012
7
Improving VPN Security Performance Based onOne-Time Password Technique Using Quantum Keys
Montida Pattaranantakul, Paramin Sangwongngam and Keattisak SripimanwatOptical and Quantum Communications Laboratory
National Electronics and Computer Technology CenterPathumthani, Thailand
[email protected], [email protected], [email protected]
Abstract—Network encryption technology has become an essentail factor for organizational security. Virtual Private Network (VPN) or VPN encryption technology is the most popular technique use to prevent unauthorized users access to private network. This technique normally rely on mathematical function in order to generate periodic key. As the result, it may decrease security performance and vulnerable system, if high performance computing make rapid progress to reverse mathematical calculation to find out the next secret key pattern. The main contribution of this paper emphasizes on improving VPN performance by adopting quantum keys as a seed value into one-time password technique to encompasses the whole process of authentication, data confidentiality and security key management methodology in order to protect against eavesdroppers during data transmission over insecure network.
Keywords- Quantum Keys, One-Time-Password , Virtual Private Network
I. INTRODUCTION
The evolution of information technologies has been growing rapidly in order to meet human communication need today. In which the security of data transmission has always been concerned to transfer information from sender to receiver over internet channel in a secure manner. Addressing on the network security issues are the main priority concern to protect against unauthorized users since the security technique should also cover data integrity, confidentiality, authorization and further non-repudiation services.
The lack of adequate knowledge with well known understanding of software architecture and security engineering leads to security vulnerabilities due to the eavesdroppers migh be able to gain information by monitoring the transmission for pattern of communication, the capability to detect data packets during transmission over internet, or enable to access information within private data storage that may lead to the occurance of data loss and data corruption. This is the critical factor cause to new threads arise and may effect business objectives change. In terms of worst case scenario this will definitely affect to organizational stability, business opportunities and then may become a national security threat. For this reason, many organizations have to pay attention in order to find out the way to protect their information from eavesdroppers based on security technology solutions that agile enough to adapt itself and combat with an existing threats due to security breach.
Therefore, data reliability and security protection are primary concern for information exchange through unprotected network connections in order to verify user, since only an authorized user can entrance and ability to govern the resource access, while encryption technology is also required for further data protection.
Presently, there are several types of cryptography [1] that have been used to achieve comprehensive data protection based on proven standard technology due to it is the most important aspect of network and communication security which provide as a basic building block for computer security. According end to end security encryption typically rely on application layer closest to the end user thus only data is encrypted. While, network security encryption where IPsec comes into play to encompasses confidentiality area by encapsulating security payload in both transport mode operation and tunnel mode operation through this type of encryption the entire IP packet including headers and payload are encrypted. IPsec encryption based on Virtual Private Network technology [2] presents an alternative approach for network encryption since it fully provide trusted collaboration framwork to be able to communicate each other over private network. Nevertheless, user authentication mechanism, cryptographic algorithms, key exchange procedure and traffic selector information need to be configured and maintained among two endpoints in order to establish VPN trusted tunnel before data transmission begins.
Although, the widespread usage of classical VPN can improve data transfer rate with maximum throughput, minimum delay and well guranteed on non bottleneck occurrence due to every communication routes is built the shortest part communication with independent IPsec to improve elastic traffic performance. In contrast, key exchange procedure during VPN setup is still a major hurdle process of vulnerability, if either secret key get trapped or key pattern broken up. In addtion, most of the random numbers have been used as the secret keys into cryptographic algorithms derived based on mathematical functions. This key generation manner is the one of the potential security vulnerabilities for data communications when computer technology become a high performance computing such make rapid progress to reverse mathematical calculation to find out secret key value.
One-time password mechanism [3] using quantum keys as a seed value into hash function can be solved a traditional VPN security problem in which it can eliminate the
The Eighth International Conference on Computing and Information Technology IC2IT 2012
8
spoofing attack caused by an eavesdropper has successfully masquerades as another by falsifying data and thereby gaining an illegitimate advantage. The main contribution of this paper emphasizes on improving VPN security performance that adopted one-time password technique to generate corresponding once symmetric key upon a time for further VPN tunnel establishment as using quantum keys as a seed value. Thus, the two endpoints are typically authenticated themselves in a secure manner process which rely on confidential protection. While, quantum keys have been proposed to avoid repeating the same password several times due to traditional password creation was derived from mathematical calculation may lead to system vulnerabilities. Addressing on quantum keys bring perfectly security enhancement of password generation due to the beauty of quantum key distribution (QKD) [4] promises to revolutionize secure communication by providing security based on the fundamental laws of physics [5], instead of the current state of mathematical algorithms or computing technology [6].
This paper is organized as follows. Section II an overview of VPN architecture and mechanism where the theory has been applied upon design processes, technical solution and implementation approach. In section III gives a details view to design a VPN security architecture for further VPN tunnel establishment. Since, the entire information are transferred through this corresponded tunnel regarding to the authorization control. Section IV comparison and analysis of an existing VPN security method and proposed challenge idea will be disscussed. Finally, some concluding remarks and future works are mentioned in section V.
II. VPN ARCHITECTURE AND MECHANISM
Fortunately, there are several network encryption technologies that have been used to protect private information from eavesdroppers over insecure network. At the current VPN encryption technology has become an attractive choice and widely used for protection against network security attacks.
VPN encryption mechanism normally process as Client/Server operation in order to establish a direct tunnel between source address and destination address while the virtual private network is built up. All data packet are consecutively passed over VPN tunnel. Due to the merit of VPN technology can reduce network cost consumption cause from physical leased lines, so that the users can exchange such private information with high data protection and trust. In addition, VPN architecture is encompassed based on authentication [7], confidentiality and key management functional areas. According to authentication service is typically used to control the users when entrance into the system, only authorized user able to do forward to encrypt a tunnel during process of VPN connection start up. As the result, the authentication header is inserted between the original IP header and the new IP header shown in figure 1. Next, confidentiality service provide message encryption to prevent eavesdropping by third parties. Finally, Key
management service has been concerned in order to handle a model of secure keys exchange protocol.
III. DESIGNING A NEW VPN SECURITY ARCHITECTURE
Basically, VPN tunnel encryption can be classified into two main methods. There are public key encryption method and symmetric key encryption method. The paper has been addressed only on symmetric key encryption that adopted one-time password mechanism such one time key encryption is used over time when VPN connection start up and finally destroy the keys when disconnection. Since, the one-time keys is originated from quantum keys as a seed value into hash function [8][9]. The mechanism is covered both user authentication process and tunnel establishment in order to prevent against data integrity problem. Therefore, the overview of VPN security architecture are mainly divided into three major modules.
A. User Registration Module
In order to improve VPN security performance such a connection, user registration module must required for either first time entry or password is expired. This module will be activated when new users enroll to the system to request legitimate password. The figure 2 show user registration procedure that each idividual process are explained as follow. The result of this step will indicate the corresponding password to those users. Such the password will be essentail used in the step of user authentication and negotiation.
1) New user login/ Password expired: This case can be occurred with two reasons. When new users who need to register into the server want to ask for the legitimate password, or their password are expired due to it exceeded the password life time. So, the registration phase will be activated to regenerated a new password.
2) Request for the password: The users transfer his/her identity information including an official name, identification number/passport number, date of birth, address and so on to the server in order to request a legitimate password. The correspondence user
Figure 1. The scenario of VPN encryption techniques
The Eighth International Conference on Computing and Information Technology IC2IT 2012
9
information will be stored in the user right database for further reference.
3) Random a unique 10 digit code: Generate a legitimate password based on random selection from Quantum Key Distribution (QKD) device.
4) Store username and password: The username and the legitimate password where created from QKD device called quantum keys are fabricated into the basic form of hash function. Therefore, only the corresponding username and the password hash value will be stored in the user right database wherewith the server does not also know the exact password value.
5) Transfer the legitimate password: The legitimate password will transfer back to the corresponding user across trusted channel to avoid password attack.
6) Treated as confidential information: Thus, the legitimate pasword will be used to verify user him/herself in authentication phase whether the user is authorized to perform VPN establishment.
B. User Authentication Mechanism Module
User authentication procedure maintains high level of security through one-site checking. Before creating VPN tunnel, the users must verify themselves with the server by logging into the system with corresponding username and password when the users had been registered at the beginning of the registration phase. The stages of the process are similar to S/key authentication mechanism [10] show in the figure. 3. Hence, only the authorized user can go forward to create a secure VPN tunnel. The user authentication procedures can be explained as follows.
1) Logging into the system: After user registration phase finished, if the user wish to create secure VPN tunnel then authentication phase will proceed respectively. Since, the username and the password must register to the server for the authentication whether a user is authorized to perform a task.
2) Received username and password: At the start service mode begins, the server is waiting for user call. When
the server get the signal of user authentication, the username and the password will be temporal stored as the input of hash function.
3) Password expiration checking: According to this function will examine the password life cycle due to the password life time is exceeded than the permitted allowance which may decrease security performance. Therefore, password expiration checking procedure was introduced to avoid against password attack.
4) Alert the password is expired: The password expiration result will send back to the corresponding user to notify whether password is valid or invalid. Invalid password result will return back to user registration phase in order to re-enrollment again, otherwise continue to the next procedure.
5) Computing password hash value: Computing the password hash value with the corresponding username and the password obtained from the previous process.
6) Comparing with an existing password hash valule: Comparing the calcuated password hash valule with an existing password hash value in the user right database.
7) Alert username and password are invalid: The comparision result will acknowledgement back to the corresponding user such only authorized user able to continue performing VPN tunnel establishment .
C. VPN Tunnel Establishment Based On One-Time Password Mechanism Module
The proposed technique has adopted two unique features of one-time password mechanism and the promise of quantum key exchange. One-time password mechanism which each password is used only once upon a time and frequenly updated when a new connection has been established to eliminate the possibility of attack that may come from replay attack, spoofing attack or birthday attack. In addition, using one-time password mechanism based on hash chain are more elegant design and attractive properties such to be able to reach the high security performace.
Figure 2. User rigistration phaseFigure 3. User authentication phase
The Eighth International Conference on Computing and Information Technology IC2IT 2012
10
While, applying the quantum keys as a seed value into a hash function will improve more efficiency and security against the system. The fascination of quantum technology that uses polarization property to ensure the transmistted keys. In which the keys can not be trapped by eavesdroppers due to it may affect the key error rate more than a certain threshold vulue.
Moreover, the VPN tunnel establishment based on one-time password mechanism using quantum keys as a seed input to the hash function can be illustrated as the figure 4. Such, creating a hash chain value is called the response value which it is computed from both the password phase, the quantum seed and the sequence number in order to establish high secure VPN tunnel. The process get started reversing from the Nth
element of the hash chain which referred from the sequence number in order to identify the current position of response value to be used and finally destroy after the VPN tunnel get disconnected. The procedure of creating high secure VPN tunnel are explained as follows.
1) Manual copy quantum key and sequence number: At the first time of VPN tunnel setting up, the quantum keys as consider as a seed value (QKS) which it was generated from Quantum Random Number Generator (QRNG) [11]. While, the sequence number (SN) is used to indicate the hashing order. Both of the two values are manually distributed to the user in such a way that to prevent being attacked. These key values will be the input of the hash function. For any further reestablished a VPN tunnel, this procedure will not be proceeded until the sequence number become zero.
2) Generate response value at user site: The user need to complete the legitimate password acquired from the registration phase, the quantum key seed and the sequence number which allocated from the server to
each particular user such the inputs of hash function. Thus, an intermediate response value will be generated after finish its processing.
3) Transfer response value: The response value from the proper user site has been typically transmitted to the server in order to identity comparation .
4) Generating response value at the server site: The server will also generate its response value based on the relevant user information that had been stored in a user right database.
5) Comparing two response values: These the two response values are compared with each other. Matching result will be considered as a symmetric key encryption for VPN tunnel establishment. According to this procedure is very attractive approach due to it offer high security protection without performing a key exchange over insecure network.
6) Establishing a VPN tunnel at the server site: If the two response values are matched, then go forward to establish a tunnel. Creating a one site tunnel for secure connections the server will assigns virtual IP address to the user from local virtual subnet.
7) Establishing a VPN tunnel at user site: The procedue for setting up VPN tunnel at user site are similar to the one at server site via extenal network interface.
8) Site-to-Site VPN tunnel: The user and the server use virtual network interface to maintain a encrypted virtual tunnel. The entire information such as the actual user data, the ultimate source and the destination address are carried as a payload with authentication header. Lastly, the virtual IP address is inserted to the packet before transmission.
IV. COMPARISION AND ANALYSIS
This paper performs a comparasion and analysis for the qualities of VPN connection using the different types of encryption techniques including the term of symmetric key encryption, public key encryption. Finally, the proposed system has been distinguished since the technique adopted from classical encryption technique with added more features of one-time password mechanism and quantum keys to improve security performance.
A. Key Pattern Generator And Its Properites
Most of high secure communications are belonged with two important key factors. One is the quality of random number using as a cryptographic key and the two is the complexity of encryption algorithms. The algorithm will produce a different output that depend on the specific key being used at the time. Addressing to the sources of key generator can be produced into two different ways. Pseudo random number generators is the algorithms uses regarding to the mathematical functions, or simply precalculated tables in order to produce the sequence numbers to appear as the random value based on periodic pattern since this technique may decrease the securiy key performance due to it is feasible to find the next value of key generation from the existed
Figure 4. VPN tunnel establishment phase
The Eighth International Conference on Computing and Information Technology IC2IT 2012
11
pattern. As the result, it can taking to the risks associated with the use of pseudo random numbers to produce a key into cryptography systems. While, the proposed technique has applied the challenge of quantum keys to be used in both password generation and VPN tunneling encryption. Concerning to the performance of quantum keys which is actually true random number generator based on quantum physics use. The fact that subatomic particles appear to behave randomly in circumstances which difficult to find out the key value. The probability of key generation based on aperiodic pattern along with the random distribution scheme can increase the quality of key performance as great as the data security enhancement.
B. Security Key Protection and Performance
One of the best characteristics of QKD technology offers a promising unbreakable way to secure communications. In the way that eavesdroppers are trying to attempt to intercept the quantum keys during the key exchange state is detectable by introducing an abnormal Quantum Bit Error Rate (QBER). The result may occur error rate more than a certain threshold value due to unavoidable disturbance including an imperfect system configuration, noise or eavesdropper across to the quantum channel and secret key generation with respect to time. Hence, the exploitation of quantum mechanics offer the perfectly secure communications.
C. Key Exchange Protocol
In general QKD technology describes the process of using quantum communication to establish a shared secret key between two parties which is similar to the secret key exchange architecture. The proposed technique has applied the quantum keys acquired from the QKD system and then take forward to distribute to each responding user in secure mode. When the server received the signaling such password request, a partial key is dedicated to each user for further identifying as well as some of the quantum keys has assigned as a seed value into hash function in order to generate a particular secret key use to establish a VPN tunnel for secure data communications.
D. Mechanism Used
The promising of VPN encryption technology leads to client/server confidentiality. Thus, the proposed technique is focused to establish the high secure VPN tunnel such the technique has applied the challenges of one-time password mechanism as using the beauty of quantum keys, the sequential number and the user secret password in order to produce the response value as a specific symmetric key. These symmetric key will be use once at a particular time and later destroy after the VPN tunnel disconnected. In addition, new symmetric key will be periodically change along the one-way hash function properites which able to enhance data confidentiality and network security protection.
V. CONCLUSIONS AND FUTURE WORK
Improving VPN security performance based on one-time password technique using quantum keys presents a new trend mechanism to protect against data snooping from eavesdroppers when data is transmitted over insecure network. The proposed technique has offered three main procedures of concern, user registration stage, user authentication stage and VPN tunnel establishment stage in order to figure out various vulnerabilities and attacks. User registration stage performs key generation process to create the secret password to the responding user for further authentication. Since, the secret password is truly random numbers based on QKD technology which typically rely on the beauty of key exchange method to protect against eavesdroppers rather than the pseudo random numbers derived from mathematical functions due to the probability of brute force attacks may occur, if the password can be guessed. User authentication stage provides the legitimate users with transparent authentication in such a way that managing and monitoring access to private resources. Morover, the password life cycle management and hash functions have also been applied to solve security vulnerabilities. Finally, VPN tunnel establishment stage based on one-time password mechanism bring to an attractive approach to build up the highest security of virtual private
TABLE I. PERFORMANCE COMPARASION OF DIFFERENTIAL EVOLUTION TECHNIQUES
FeaturesProperties of VPN Connections
Symmetric Key Encryption Public Key Encryption Proposed Mechanism
Key Pattern GeneratorPeriodic pettern based on mathematical functions
Periodic pettern based on mathematical functions
Random pettern based onquantum phenomenon
Key Properties Pseudo random number based on mathematical functions
Pseudo random number based on mathematical functions
True random number based on quantum physics laws
Security Key Protection and Performance
Not providedany key protection mechanism
Not providedany key protection mechanism
Quantum Bit Error Rate (QBER) ratio
Key Exchange ProtocolSecret key exchange with one classical link communication
Public key exchange protocol with one classical link communication
Secret key exchange withtwo data link communition
(Quantum channel and classical channel)
Mechanism Used Secret key encryption Public and private key encryptionSecret key encryption based on one-time
password mechanism using quantum keys
The Eighth International Conference on Computing and Information Technology IC2IT 2012
12
network thus the particular key will be used only once and destroy. In addition, the challenges of the proposed mechanism is a part of project of high efficiency key management methodology for advanced communication services (a pilot study for video conferencing system) under the user authentication and VPN establishment phase in order to prevent against unauthorized access into the restricted network and the illegal resources before distributing the quantum keys for further along secure video conferencing and any data communications services as considering the data protection and network security are the main priority for IT organization need to be concerned.
ACKNOWLEDGMENT
The authors would like to thank Dr. Weetit Wanalertlak for the invaluable feedback and the technical support to achieve such the supreme excellence and the perfection of research paper. While, the authors would like to thank NECTEC steering committee for research support funding with a great valuable opportunity for team to introduces the new challenges of data protection as well as to improve data reliability over insecure network. Finally, the authors would like to thank Mr. Sakdinan Jantarachote and all staffs of Optical and Quantum Communications Laboratory (OQC) for all kind support and encouragement.
REFERENCES
[1] William Stallings, “Cryptography and Network Security Principles and Practices”, Fourth Editio, November 2005.
[2] Kazunori Ishimura, Toshihiko Tamura, Shiro Mizuno, Haruki Sato and Tomoharu Motono, “Dynamic IP-VPN architectuire with secure IP tunnels”, Information and Telecommunication Techonologies, June 2010, pp. 1-5.
[3] Young Sil Lee, Hyo Taek Lim and Hoon Jae Lee “A Study on Efficient OTP Generation using Stream with Random Digit”,Internatinal Conference on Advanced Communication Technology 2010, volume 2, pp. 1670-1675.
[4] W. Heisenberg, “Uber den anschaulichen Inhalt der quantenheoretischen Kinematik und Mechanik” In: Zeitschrift fur Physik, 43 1927, pp. 172-198
[5] W.K. Wootters and W.H. Zurek, “A Single Quantum Cannot be Cloned”, Nature 299, pp. 802-803, 1982.
[6] Erica Klarreich, “Quantum Cryptography: Can You Keep a Secret”, Nature, 418, 270-272, July 18, 2002.
[7] Hyun Chul Kim, Hong Woo Lee, Kyung Seok Lee, Moon Seong Jun, “A Design of One-Time Password Mechanism using Public Key Infrastructure”, Fourth Internatinal Conference on Network Computing and Advanced Information Management, September 2008, pp. 18-24.
[8] Harshvardhan Tiwari, “Cryptographic Hash Function: An Elevated View”, European Journal of Scientific Researchm, ISSN 1450=216X, Vol.43, No.4 (2010), pp. 452-465.
[9] Peiyue Li, Yongxin Sui, Huaijiang Yang and Peiyue Li “The Parallel Computation in One-Way Hash Function Designing”, International Conference on Computer, Mechatronics, Control and Electronic Engineering, Aug 2010, pp. 189-192.
[10] C.J.Mitchell and L. Chen, “Comments on the S/KEY user authentication scheme”, ACM Operating Systems Review, Vol.30, No. 4 , 1996, 10, pp. 12-16.
[11] ID Quantique White Paper, “Random Number Generation Using Quantum Physics”, Version 3.0, April 2010.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
13
Experimental Results on the Reloading WaveMechanism for Randomized Token Circulation
Boukary OuedraogoPRiSM - CARO, UVSQ
45, avenue des Etats-UnisF-78035 Versailles Cedex, France
Email: [email protected]
Thibault BernardCRESTIC - Syscom, URCA,Moulin de la Housse BP-1039
F-51687, Reims cedex 2, FranceEmail: [email protected]
Alain BuiPRiSM - CARO, UVSQ
45, avenue des Etats-UnisF-78035 Versailles Cedex, France
Email: [email protected]
Abstract—In this paper, we evaluate experimentally the gain ofa distributed mechanism called reloading wave to accelerate therecovery of randomised token circulation algorithm. Experimen-tation will be realised under different context: static networksand dynamic networks. The impact of different parameters suchas connectivity or frequency of failures will be investigated.
I. INTRODUCTION
Concurrence control is one of the most important require-ment in distributed systems. The emergence of wireless mobilenetworks has renewed the challenge to design concurrencecontrol solutions. These networks require a new modeling andnew solutions to take into account their intrinsic dynamicity. In[Ray91], the author classifies concurrency control in two types:quorum based solutions and token circulation based solutions.
Numerous papers deals with token circulation based solu-tions because they are easier to implement: a single tokencirculating represents the privilege to access the shared re-source (unicity of the token guarantee the safety, and perpetualcirculation among all nodes guarantee the liveness). In thecontext of dynamic networks, random walks based solutionhave been designed(see [Coo11]).
Properties of random walks allow to design a traversalscheme using only local information [AKL+79]: such ascheme is not designed for one particular topology and needno adaptation to other ones. Moreover, random walks offerthe interesting property to adapt to the insertion or deletionof nodes or links in the network without modifying anyof the functioning rules. With the increasing dynamicity ofnetworks, these features are becoming crucial: redesigning anew browsing scheme at each modification of the topology isimpossible.
An important result of this paradigm is that the token willeventually visit (with probability 1) all the nodes of a system.However it is impossible to capture an upper bound on thetime required to visit all the nodes of the system. Only averagequantities for the cover time, defined as the average time tovisit all the nodes are available.
The token circulation can suffer different kinds of failures:in particular, (i) situations with no token and (ii) situation withmultiple tokens may occur. Both of them have to be managedto guarantee the liveness and safety properties of concurrencecontrol solutions.
The concept of self-stabilization introduced in [Dij74] isthe most general technique to design a system to toleratearbitrary transient faults. A self-stabilizing system guaranteesto converge to a legitimate state in a finite time no matter whatinitial state it may start with. This makes a self-stabilizingsystem be able to recover from transient faults automaticallywithout any intervention.
To design self-stabilizing token circulation, numerous au-thors build and maintain spanning structures like tree orring (cf. [CW05], [HV01] and use the ”counter flushing”mechanism ([Var00]) to guarantee the presence of a singletoken. In the case of a random walk based token circulation,the ”counter flushing” can not be used. In [DSW06], theauthors use random circulating tokens (they call agents) tobroadcast information in communication group. To cope thesituation where no agent exists in the system, authors use atimer based on the cover time of an agent (k × n3). Theyprecise as a concluding remark. ”The requirements will holdwith higher probability if we enlarge the parameter k forensuring the cover time[. . . ]”. In the case of a concurrencecontrol mechanism the obtention of a single token is a strongrequirement, and the use of a parameter k which increases theprobability to reach a legitimate configuration could not beused.
We have introduced in [BBS11], the reloading wave mech-anism. This mechanism insures the obtention of single tokenand then the safety property of concurrence control solution.
We propose in this paper an experimental evaluation of thismechanism under different parameters: timeout initialization,connectivity of the network, dynamicity of the network andfailures frequency.
In order to test or validate a solution, the authors of[GJQ09] proposed four classes of methodologies: (i) in-situwhere one execute a real application (program, set of services,communications, etc.) on a real environment (set of machines,OS, middleware, etc.), (ii) emulation where one execute a realapplication on a model of the environment, (iii) benchmarkingwhere one execute a model of an application on a realenvironment and (iv) simulation where one execute a modelof an application on a model of the environment. To each ofthese methodologies corresponds a class of tools: real-scaleenvironments, emulators, benchmarks and simulators.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
14
In this paper, we adopt the simulation class of method-ologies and use simulators due to the fact that simulationallows to perform highly reproducible experiments with alarge set of platforms and experimental conditions. Simulationtools support the creation of repeatable and controllableenvironments for feasibility study and performance evaluation[GJQ09], [SYB04].
Simulation tools for parallel and distributed systems can beclassified into three main categories:(i) Network simulation tools, Network Simulator NS-2 is a
simulator that supports several levels of abstraction tosimulate a wide range of network protocols via numeroussimulation interfaces. It simulates network protocols overwired and wireless networks. SimJava [HM98] providesa core set of foundation classes for simulating discreteevents. It simulates distributed hardware systems, com-munication protocols and computer architectures.
(ii) Simulation tools for grids, the most common tools in-clude: GridSim [BM02] supports simulation of space-based and time-based, large scale resources in the Gridenvironment. SimGrid [CLA+08], simulates a single ormultiple scheduling entities and timeshared systems op-erating in a Grid computing environment. It simulates dis-tributed Grid applications for resource scheduling. Dasor[Rab09] is a C++ library for discrete event simulation fordistributed algorithms (Management of networks (withtopologies), Failure models, mobility models, communi-cation models . . . , structures (trees, matrices, etc.) . . . .It is based on a multi-layers model (Application, GridMiddleware, Network).
(iii) Simulation tools for peer-to-peer networks, PeerSim[MJ09] supports extreme scalability and dynamicity. It iscomposed of two simulation engines, a simplified (cycle-based) one and an event driven one.
In [GJQ09], the comparison made between different sim-ulators for Networking and Large-Scale distributed systemsshow that any such tool provides a very high control ofthe experimental conditions (only limited by tool scalability),and a perfect reproducibility by design. The main differencesbetween the tools are (i) about the abstraction level (moderatefor network simulators, high for grid ones and very high forP2P ones), (ii) the achieved scale also greatly varies from toolto tool.
In the second section we present the model of the tokencirculation algorithm that uses the reloading wave mechanism.In the third section, we propose an experimental evaluation ofthe reloading wave mechanism. Finally we will conclude thepaper by presenting perspectives.
II. THE RELOADING WAVE MECHANISM
The reloading wave mechanism has fully described in[BBS11]. This mechanism has been designed to satisfy thespecifications of a single token circulation in presence offaults:(i) there is exactly one token in the system,
(ii) each component of the system is infinitely often visitedby the token.
The random walk token moving scheme insures to get thesecond part (ii) of the specification verified (as soon there is noadversary that plays with mobility of the components againstthe random moves of the token). Starting from an arbitraryconfiguration, the first part of the specifications can be entailedby following situations:
• absence of token,• multiple tokens situation.To manage the absence of token, each node setups a timeout
mechanism. Upon a timeout triggering, a node creates a newtoken, and then the absence of token situation no longeroccurs. The multiple token situation is managed like in [IJ90]:when several tokens meets on a node, they are merged into oneby a mergure mechanism. But unfortunately the combinaisonof the two mechanisms does not guarantee the presence ofexactly one token: if a subset of nodes is not visited by thetoken during a sufficiently long period, tokens creation canstill occurs even if there already exists a token.The goal ofthe reloading wave mechanism is to prevent these unnecessarycreations of tokens.
This prevention is realized by the token itself: it periodicallypropagates an information meaning that it is still alive. Thereloading wave uses several tools for its operations:
• a timeout mechanism: All nodes in the network assumea timeout procedure: a timer whose value decrements ateach tick clock. At the expiration of a node timer, thecorresponding node creates a new token and sends it toone of its neighbors as a random walk moving scheme.Remark that several tokens can circulate in the network.
• A adaptive spanning structure of the network topologythat is stored in the token. The spanning structure is storedas a circulating word that represents a spanning tree. Thistree is used to propagates the reloading wave. Every nodethat receives a reloading wave message resets its localtimer and then propagates the reloading wave message toall its sons according to the spanning tree maintained inthe word of the token.
• a hop counter that is stored in the structure of the token:Initialized to zero when creating the token, this hopcounter is incremented at every step of the random walk.It is reseted to zero each time the node that owns thetoken triggered a reloading wave propagation.
The different phases of the reloading wave mechanism arethe following:
1) Phase of reloading wave triggering• At the reception of a token on a node, the word con-
tent and the hop counter of the token are updated.• The reloading wave mechanism begins as soon as a
node, at the reception of a token, is aware that thetriggering condition is satisfied. The triggering con-dition is: if a received token hop counter (NbHop) isequal to the difference between initialization valueof timer (Tmax) and network size value (N). In
The Eighth International Conference on Computing and Information Technology IC2IT 2012
15
other words, the reloading wave is triggered at eachinterval of (Tmax − N ) steps of the token randomwalk. During this phase, the hop counter of thetoken is reset to zero.
• Reloading wave Messages are created by nodesat the initiative of a token more precisely its hopcounter (NbHop).
• Several reloading waves can be created (simulta-neously or not) and propagated through the treemaintained in the token word.
2) Phase of reloading wave propagationThe propagation of the wave takes place along an adap-tive tree contained in the word of the circulating token.Every node that receives a reloading wave messageresets its local timeout and then propagates too the waveto all its sons according to the adaptive tree maintainedin the token word.
3) Phase of reloading wave terminationThe reloading wave mechanism terminates when reload-ing waves reached all nodes of the virtual tree main-tained in token word or when transient faults obstructedits diffusion.
The complete implementation of the mechanism can be foundin [BBS11]. In the next section, we experiment the reloadingwave mechanism to evaluate its relevance.
III. EXPERIMENTAL RESULTS
Our simulation model is written in C++ using DASOR[Rab09], a C++ library for discrete event simulation fordistributed algorithms. The DASOR library provides lots ofinteresting structures and tools that makes an easy way to writesimulators.
We investigate experimental results of the reloading wavemechanism under three contexts: static network (no node con-nection / disconnection, no failure), dynamic network (nodeconnection / disconnection, no failure) and network subjectto failure (node connection / disconnection, token creation /deletion).
A. Experimental protocol
For each parameter investigated, we measure the timeelapsed in a satisfying configuration for two solutions:
1 A solution where a token is circulating according arandom walk scheme. A timeout is initialized on eachnode, to eventually create new tokens when expiring. Themerger mechanism is triggered when several token arepresent on the same node.
2 The same protocol but with the addition of the reloadingwave mechanism as described in the previous section.
A satisfying configuration is a configuration where thereis exactly one token present in the system. For each set ofparameters, we present the result as a difference between thetime elapsed in satisfying configuration with solution 2 andthe time elapsed in satisfying configuration with solution 1.
We evaluate the impact of several important parameters:
• Size of the network• Timeout initialization• Mobility range of the nodes• Frequency of failures
B. Experimentation
Each experimentation will be repeated 100 times, all theresults obtained are the mean over all the experimentation.The standard deviation has been computed but is negligible.
1) Static networks, impact of size and of the timeout ini-tialization: We set up the timeout values in function of thesize of the network (n). We take 2n, 3n, 4n, 5n as timeoutinitialization. Intuitively the greater the timeout is, the less thedifference between with and without reloading wave is, sincetoken creation occur on a timeout triggering (a non necessarytoken creation compromises the satisfying configuration). Onthe other hand, the greater the size is, better the solutionwith reloading wave work since the mechanism avoids allunnecessary token creation. The solution without the reloadingwave mechanism have to insure the visit of all nodes during atimeout period to avoid token creation. Greater the size of thenetwork is, more difficult is to visit all nodes of the networkwith a random moving policy.
The results are given in TableI with the form T1 − T2 = ∆where T1 is the percentage of time elapsed in a satisfyingconfiguration with the solution with reloading wave, T2 thepercentage without using the reloading wave mechanism and∆ the difference.
Timeout initialization T = f(n)Size n 2n 3n 4n 5n
50 99-22= 77% 99-46= 53% 99-67= 32% 99-82= 17%100 99-16= 83% 99-38= 60% 99-58= 41% 99-74= 25%200 99-11= 88% 99-32= 67% 99-50= 49% 99-66= 33%300 99-10= 89% 99-27= 72% 99-46= 53% 99-62= 37%
Table IDIFFERENCE BETWEEN THE SOLUTION WITH RELOADING WAVE AND THE
SOLUTION WITHOUT RELOADING WAVE FOR STATIC NETWORKS
Our intuition is verified: the reloading wave avoid allunnecessary token creations (the system is in a satisfyingconfiguration 99% of time, the last 1% correspond to theinitialization phase, where there is not enough collected datato propagate the reloading wave). The size decrease theperformance of the solution without the reloading wave andthe timeout improves its performance.
2) Dynamic networks, impact of the dynamicity and fail-ures: A dynamic network is subject to topological reconfig-urations and failure, we investigate the impact of these twoparameters on the behavior of the reloading wave mechanism.The two solutions (with and without reloading wave) havebeen experimented on a random graph of 300 nodes witha density of 60% (i.e. a link between two nodes has theprobability 0.6 to exist at the initialization of the network).
We have used a mobility pattern where: (i) the movementsof nodes are independent of each other, (ii) at any time
The Eighth International Conference on Computing and Information Technology IC2IT 2012
16
there is a fixed number of nodes randomly chosen that aredisconnected, (iii) the duration of the disconnection is setarbitrary to 1 time unit. This model can be assimilated to therandom walk mobility model (cf. [CBD02]).
We set this parameter to get:• A low mobility pattern: at a given time, 1% of nodes
are disconnected. This value is reasonable to evaluate theperformance of the algorithm in the conditions of a slowmoving network.
• An average mobility pattern: at a given time, 5% ofnodes are disconnected. This value is reasonable to eval-uate the performance of the algorithm in the conditionsof a medium speed moving network.
• A high mobility pattern: at a given time, 10% of nodesare disconnected. This value is reasonable to evaluate theperformance of the algorithm in the conditions of a fastmoving network.
In the same way, a failure model has been applied: all tokenmessages have the same probability p to fail at every timeinterval t. We set p to 0.05%, since it seems to be a realisticvalue for a message loss in a network and t to:
• A low failure pattern: Each 1000 turns, every token hasa probability 0.05% to be lost.
• An average failure pattern: Each 100 turns, every tokenhas a probability 0.05% to be lost.
• A high failure pattern: Each 10 turns, every token hasa probability 0.05% to be lost.
Results are given in Table. II with the form T1 − T2 = ∆where T1 is the percentage of time elapsed in a satisfyingconfiguration with the solution with reloading wave, T2 thepercentage without using the reloading wave mechanism and∆ the difference.
Token loss frequencyMob. freq. None Low Average High
None 99-10= 89% 9-12= -3% 0-0= 0% 0-0= 0%Low 34-31= 3% 30-29= 1% 26-26= 0% 25-25= 0%
Average 34-31= 3% 29-29= 0% 26-26= 0% 25-25= 0%High 34-31= 3% 30-29= 1% 26-26= 0% 25-25= 0%
Table IIDIFFERENCE BETWEEN THE SOLUTION WITH RELOADING WAVE AND THE
SOLUTION WITHOUT RELOADING WAVE FOR DYNAMIC NETWORKS
Frequent token loss decrease greatly the performance ofboth solutions (for the given token loss frequency parameter,between 30% and 25% of satisfying configurations). The gainof the reloading wave in the static context is marginal in thecontext where token can be lost (less than 1% for low, averageand high token loss frequency). This is not surprising: thereloading wave mechanism relies on the persistence of thetoken. As soon as the token can be lost, the spanning treestored inside the token can not be built, and then several nodescan create new tokens.
The impact of the mobility is not the same. Of course, a toofrequent mobility pattern decreases highly the performance ofboth solutions, but it remains that the reloading wave solution
has a marginal gain on the solution without reloading wavewhen there is no token loss (about 3%). We think the reloadingwave could be use for network with a very slow mobilitypattern: if the frequency of node move is low, the spanningtree stored inside the token has enough time to be updated,and the reloading wave mechanism could work correctly.
IV. CONCLUSION
In this paper, we have investigated experimental results ofthe reloading wave which is a mechanism to avoid unnecessarytoken creations, in static networks, dynamic networks andnetwork subject to failure.
In a static environment, the reloading wave works perfectly(about 99% of satisfying configurations, the 1% remainingcorresponds certainly to the initialization of the spanningstructure on which the reloading wave is broadcast). Thedifference between the two solutions (with / without reloadingwave) increases with the augmentation of the timeout valueand decreases with the size of the network.
The mobility of nodes has an impact on the functioning ofthe reloading wave: mobility of nodes can break the spanningstructure used to broadcast the reloading wave. In [BBS11]we exhibit a mobility pattern on which the reloading waveworks correctly. In our experimentation this mobility patternhas not be implemented. We think the mobility used in theexperimentation is too important to fit the criterion of themobility pattern on which the reloading wave works. A newset of experimentation on the mobility pattern is investigated.
The occurrence of failures has an impact on the reloadingwave mechanism. As the mechanism is initiated by the to-ken, a token loss occurring frequently, decreases highly theperformances of reloading wave (about 25% of satisfyingconfigurations with the given parameters). In most token cir-culation algorithms, token loss is considered as an improbableevent and a recovery has to be manage carefully. Our solutionis not an exception to the rule, a recovery takes a longtime (according timeout value) being elapsed in a set of nonsatisfying configurations.
REFERENCES
[AKL+79] R. Aleliunas, R. Karp, R. Lipton, L. Lovasz, and C. Rackoff.Random walks, universal traversal sequences and the complexity of mazeproblems. In 20th Annual Symposium on Foundations of ComputerScience, pages 218–223, 1979.
[BBS11] Thibault Bernard, Alain Bui, and Devan Sohier. Universal adaptiveself-stabilizing traversal scheme: random walk and reloading wave. CoRR,abs/1109.3561, 2011.
[BM02] Rajkumar Buyya and Manzur Murshed. Gridsim: A toolkit forthe modeling and simulation of distributed resource management andscheduling for grid computing, concurrency and computation. Practiceand Experience (CCPE), 14:Issue 13–15, 1175–1220, December 2002.
[CBD02] T. Camp, J. Boleng, and V. Davies. A survey of mobility modelsfor ad hoc network research. Wireless Communications and MobileComputing (WCMC): Special issue on Mobile Ad Hoc Networking:Research, Trends and Applications, 2(5):483–502, 2002.
[CLA+08] Henri Casanova, Legrand, Arnaud, Quinson, and Martin. SimGrid:a Generic Framework for Large-Scale Distributed Experiments. In 10thIEEE International Conference on Computer Modeling and Simulation,March 2008.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
17
[Coo11] Colin Cooper. Random walks, interacting particles, dynamic net-works: Randomness can be helpful. In 18th International Colloquium onStructural Information and Communication Complexity, Gdansk, Poland,June, 2011, volume 6796 of Lecture Notes in Computer Science, pages1–14. Springer, 2011.
[CW05] Yu Chen and Jennifer L. Welch. Self-stabilizing dynamic mutualexclusion for mobile ad hoc networks. J. Parallel Distrib. Comput.,65(9):1072–1089, 2005.
[Dij74] Edsger W. Dijkstra. Self-stabilizing systems in spite of distributedcontrol. Commun. ACM, 17(11):643–644, 1974.
[DSW06] S. Dolev, E. Schiller, and J. L. Welch. Random walk for self-stabilizing group communication in ad hoc networks. IEEE Trans. Mob.Comput., 5(7):893–905, 2006.
[GJQ09] Jens Gustedt, Emmanuel Jeannot, and Martin Quinson. Experimen-tal methodologies for large-scale systems: a survey. Parallel ProcessingLetters, 19(3):399–418, 2009.
[HM98] Fred Howell and Ross McNab. Simjava: A discrete event simulationpackage for java with applications in computer systems modelling. In Pro-ceedings of the First International Conference on Web-based Modellingand Simulation, San Diego CA, January 1998. Society for ComputerSimulation.
[HV01] Rachid Hadid and Vincent Villain. A new efficient tool for the designof self-stabilizing l-exclusion algorithms: The controller. In Ajoy KumarDatta and Ted Herman, editors, WSS, volume 2194 of Lecture Notes inComputer Science, pages 136–151. Springer, 2001.
[IJ90] Amos Israeli and Marc Jalfon. Token management schemes andrandom walks yield self-stabilizing mutual exclusion. In PODC, ACM,pages 119–131, 1990.
[MJ09] Alberto Montresor and Mark Jelasity. PeerSim: A scalable P2Psimulator. In Proc. of the 9th Int. Conference on Peer-to-Peer (P2P’09),pages 99–100, Seattle, WA, September 2009.
[Rab09] C. Rabat. Dasor, a Discret Events Simulation Library for Grid andPeer-to-peer Simulators. Studia Informatica Universalis, 7(1), 2009.
[Ray91] Michel Raynal. A simple taxonomy for distributed mutual exclusionalgorithms. Operating Systems Review, 25(2):47–50, 1991.
[SYB04] Anthony Sulistio, Chee Shin Yeo, and Rajkumar Buyya. Ataxonomy of computer-based simulations and its mapping to parallel anddistributed systems simulation tools. Softw. Pract. Exper., 34:653–673,June 2004.
[Var00] George Varghese. Self-stabilization by counter flushing. SIAM J.Comput., 30(2):486–510, 2000.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
18
Statistical-based Car Following Model for Realistic Simulation of Wireless Vehicular Networks
Kitipong Tansriwong and Phongsak Keeratiwintakorn
Department of Electrical Engineering, King Mongkut’s University of Technology North Bangkok,
Bangkok, THAILAND [email protected] and [email protected]
Abstract— In present, research on mobile and wireless vehicular networks has been focused on the communication technologies. However, the behavioral study of vehicular networks is also important, and can be costly. Therefore, simulation software has been developed and used for research and study in vehicle movement and the variability of the network performance. Therefore, the realisticness of the simulation software is an on-going research work. The major problem of the simulation software is the mobility pattern or model of vehicles under study that is non-realistic due to the complexity of the model due the variation of the driver’s behavior and sometimes can be such costly that it is impractical for the study. In this research, we proposed a realistic mobility model that is the integration between the statistical analysis and the car following model. By using a real data collection to create a mobility mode based on probability distribution that is integrated to the well-known car following model, the vehicular network simulation study can be more realistic. The results of this study have shown the opportunity to combine such proposed model into a network simulation such as NCTU-ns or ns-2 simulator. Keywords- vhicular network, realistic mobility model, car following models,statistical model
I. INTRODUCTION
Vehicles as part of our business logistics and every living, the number of vehicles has increased every year, but the road capacity could not be increased at the rate equal to that of the vehicles. This causes many problems such as accidents and traffic jam, and economical loss in term of expense spend for transportation. Another issue is the lack of real-time traffic data that can help to mitigate such problems. Wireless vehicular technology is emerged with the goal to enable the communication between vehicle-to-vehicle (V2V) or vehicle-to-infrastructure (V2I). Such technology will allow the information exchange from road devices such as detectors to infrastructure or to vehicles directly for faster response to events such as accidents or emergency rescues in V2I case. In addition, data collected on the site can be used to analyze traffic information can be used in several ways for applications in traffic engineering or for traveler information center (TIC). In V2V case, the traffic information can be broadcast or distributed over a group of vehicles to immediately inform drivers such events.
V2V communication consists of moving vehicles with a variety of movement patterns that can be uncertain depending on many factors such as driver behavior, road conditions, and traffic conditions. Thus, the mobility model used to simulate vehicle communication scenarios in conjunction with V2V communication can be erroneous when compared to the realistic system. This paper proposes a solution to reduce the error due to the theoretical mobility model of vehicle movement in the simulation. In order to keep the mobility close to reality as much as possible, we propose using statistical analysis techniques used to analyze the collected data samples to form a statistical model to be integrated with the car following model. The car following model is the movement model that is proposed in transportation engineering approaches that combines driver behavior in multi-lane road infrastructure. The outcome of our proposed model can be used in available simulation software such as NCTU-ns [1], ns-2 [2], or ONE [3] simulation software.
II. RELATED WORK
Many researchers have proposed the studies that involve the mobility models of vehicles on the road that can apply to V2V communication on the roads both in traffic engineering approach and computer engineering approach. Therefore it is necessary to study the mobility models in different forms and different simulation software packages.
A. VANET mobility models
Several mobility models are proposed for use in vehicular network (VANET) for simulation study such as Freeway model, Manhattan model, and City Section model.
Freeway is a map-generated-based model as defined in [4]. This model was tested on the highway or street without a traffic light junction. The movement of vehicles in the traffic will be forced to follow the front vehicles. There is no overtaking or changing lane to traffic. The speed of the vehicle in this model is defined by history-based speed and random acceleration. When the distance between the vehicles that follow the same traffic lane is less than the specified model, the acceleration is negative. The movement pattern of the model is not realistic. The example of vehicle movement in the Freeway model is shown in Figure 1.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
19
Figure 1: The vehicle movement in the Freeway model
Manhattan is also a map-generated-based model introduced in [5] to simulate an urban environment. Eachcrossroads. By using the same speed on a freeway modelthe increase of the traffic lane, the model allows the lane change at a traffic intersection. The direction of movement of vehicle traffic at the traffic intersection used a probability andthe vehicle that has been moved cannot beexample of the vehicle movement in the Manhattan model is shown in Figure 2.
Figure 2 The vehicle movement in the Manhattan model
The City Section Mobility model [6] is cprinciple between Random Way Point model andmodel. By adding the pause time and the destination, the model uses the map based on the Mmodel with the random number of vehicles onmovement to a given destination using algorithm. The speed of the movement distance between moving vehicles and the vehicle the maximum speed of the road. The downside is the use of the grid-like map that is notrealistic road networks. The example of the City Section Mobility model is shown in Figure 3.
B. Simulation Software Package
Along with the mobility models are the simulation software for vehicular networks. Several simulation software package offer the use of several mobility models as described in this section.
MOVE [7] is a software package to generate a traffic trace based on realistic mobility model for VANET
Freeway model
based model introduced Each road will have a
freeway model with , the model allows the lane The direction of movement of
intersection used a probability and be stopped. The
example of the vehicle movement in the Manhattan model is
Manhattan model.
] is created by the oint model and Manhattan
random selection based on the Manhattan
on the road and the the shortest path depends on the
the vehicle ahead and The downside of this model
not as complex as The example of the City Section
Along with the mobility models are the simulation software for vehicular networks. Several simulation software package offer the use of several mobility models as described
software package to generate a traffic trace for VANET that works with
the SUMO simulation softwaretraffic simulator that includes the such as traffic light signals or the model ofvehicular communication. The real map to analyze traffic and based on such traffic. The MOVEand can import a real map file format or the Google Earth properties of each car such asMOVE operates and interfacesfile to be used in the network simulation such as nshows the snapshot of the SUMO
Figure 3: The pattern of the vehicle location in the Section Mobility
Figure 4: SUMO
VanetMobiSim [9] is a modifier ofSimulation Environment which is based on frameworks for mobility modeling.written in JAVA language and can generate movement tracesfiles in different formats, and or simulation tools for mobile networks. CANU originally includes parsers for maps in the Files (GDF) format and provides implementations of several random mobility models as well as models from physics and vehicular dynamics. The VanetMobiSim with vehicular mobility modeautomotive motion models at both macroscopic and microscopic levels. At the macroscopic level, Va
SUMO simulation software. SUMO [8] is a vehicular that includes the management of the traffic
or the model of traffic lanes for The SUMO program can import a and to generate a mobility model
MOVE program is built with Java real map file format such as TIGER Line file
KML file. It can specify the each car such as speed and acceleration. The
interfaces with SUMO to generate a trace network simulation such as ns-2. Figure 4
snapshot of the SUMO and MOVE software.
The pattern of the vehicle location in the City obility model
SUMO and MOVE
a modifier of the CANU Mobility which is based on flexible
mobility modeling. CANU MobiSim is and can generate movement traces
and supporting different simulation s for mobile networks. CANU MobiSim
originally includes parsers for maps in the Geographical Data and provides implementations of several
models as well as models from physics and The VanetMobiSim is designed for use
modeling, and features on realistic automotive motion models at both macroscopic and
macroscopic level, VanetMobiSim
The Eighth International Conference on Computing and Information Technology IC2IT 2012
20
can import maps from the TIGER Line database, or randomly generate them using Voronoi tessellation. VanetmobiSim can be support for multi-lane roads, separate directional flows, differentiated speed constraints and traffic signs at intersections. At the microscopic level, VanetMobiSim implements mobility models, providing realistic V2V and V2I interaction. According to these models, vehicles regulate their speed depending on nearby cars, overtake each other and act according to traffic signs in presence of intersections. Figure 5 shows snapshot of the the VanetMobiSim software.
Figure 5: The snapshot of the VanetMobiSim software
C. The Car Following Model
In the car following models, the behavior of each driver is described in relation to the vehicle ahead. With regard to each single car as an independent entity, the car following model falls into the category of the microscopic level mobility model. Each vehicles in the car following model computes its speed or its acceleration as a function of the factors such as the distance to the front car, the current speed of both vehicles, and the availability of the side lane. Figure 6 shows the example of the vehicle movement calculation of the car following model. Each vehicle is assigned its lane, i or j, and its location on the road segment. At each time slot, each vehicle’s speed, acceleration, and lane change probability is calculated based on the microscopic view of the road network and traffic. As a result, each vehicle may increase or decrease the speed and/or the acceleration. In addition, the vehicle may overtake or change the lane with the assign probability when the side lanes are available. The speed and the acceleration of the vehicle keep changing based on the conditions happening during the simulation such as car stop, congestion, or traffic stop light at a crossroad. The change is totally based on the random model without any integration of road structure properties such as the lane narrowness that can affect the speed of vehicles. In addition, as roads are connected in a network, they are different in types, structures and sometimes driving culture based on each country or city that can be classified as localized parameters. Therefore, the car following model is a realistic model that is suitable for vehicular network simulation, but it lacks of an adaptive
method to change the driving behavior according to those localized parameters.
Figure 6 The example of the car following model
III. THE PROPOSED STATISTICAL-BASED CAR FOLLOWING
MODEL
The proposed design technique for a realistic mobility model for used in a simulation on wireless vehicular network is described in this section. First, the statistical model is present. Due to the lack of the integration of the localized parameter to be taken into account in the calculation of each vehicle speed and acceleration, we proposed a statistical model of each road that is created based on the collected data on that road as a representative for the calculation in the car following model. Figure 7 shows the block process of our proposed work for the integration with the car following model.
Based on the variety of the vehicle speed data that is collected from different road types, we analyze the data and find a representative of such data set as a probability function. The probability function is used to generate vehicle speed data at a specific period based on the ID of the road on the map (Map ID). The outcome of the function, which is the speed, is used as an input to the car following model to calculate the speed, the acceleration and the lane change probability during such period. The variation of the length of the period is the tradeoff between the additional workload on the simulation and the realisticness of the model.
Figure 8 shows the process of data collection on each specific road for our study. We use a Nokia mobile phone with our written software running on Symbian OS to collect data such as the current location of the vehicle and the current vehicle speed. The locating devices used in our data collection is the global positioning device (GPS) that is connected to the mobile phone via Bluetooth connection.
Several research studies in the movement pattern of vehicles are to study the effect of the types of the roads by collecting real data for road traffic verification and validation. The road type that is currently in use can be divided into expressway or freeway road system, major or arterial road, collector road and local road.
Figure 7 The proposed statistical based car following model
The Eighth International Conference on Computing and Information Technology IC2IT 2012
21
Figure 8 The data collection process for road specific
information After we collect the data on different road types, we use a
statistical method for curve-fitting the collected data. The distribution function as the result of the curve fitting is then verified and validated. Once, we have the distribution function, we implement the function into the calculation of the car following model specific of those road types.
IV. MEASUREMENTS AND RESULTS
We collect traffic data, the vehicle location and the speed, based on four types of the roads, the express way, the major road, the collector road and the local road. These types of roads have different properties that can affect the behavior of the drivers. Then, we proposed a statistical model for each road type based on the curve fitting method.
A. The Model of Expressway
The sample of the express way that we used to collect data in Bangkok is the Ngamwongwan-Chiang Rak road. We collect the speed data at a sampling time of 2 second for each sample. The location samples of the vehicle running on the Ngamwongwan road are shown in yellow dots in Figure 9.
Figure 9 The location of vehicles on the Expressway road
Totally, we collect about 786 samples of speed data. The result of curve fitting of the collected speed data on the expressway road is shown in Figure 10. The statistical model that fits the distribution of the speed data is the Weibull function where most vehicle speed is high due to the nature of
the expressway where there is no traffic light and wide lanes. The peak speed from this collection is around 110 km/hr.
Figure 10 The PDF function of the speed on the Expressway.
B. The Model of Major Road
For the major road data collection, we chose the Vibhavadi road, which is a straight and long road with two or three traffic lane for each direction. Although the road has no traffic light, a temporary stop may occur due to the density of the cars or the results of the truck movement. We collect each speed sample at every 2 second, and totally we have 1227 values in the collection. Figure 11 shows the result of the curve fitting method over the collected data. It is shown that the speed distribution on the major road is also the Weibull distribution. However, the parameters of the function are different. For example, the peak speed of the major is around 65 km/hr that is much less than that of the express way.
Figure 11 The PDF function of the speed on the major road.
C. The Model of Collector Road
In this collection, we chose the Charansanitwong road as our example of the collector road. The road has two lanes for each direction. It has a few traffic light with a short time stop. The samples are collected every 2 second, and totally we have 511 values of the data. Figure 12 shows the result of the curve fitting method on the collected data. It is shown that the result of the curve fitting is the Gamma distribution function. Due to the traffic light, vehicles may stop and slow down. As a result,
The Eighth International Conference on Computing and Information Technology IC2IT 2012
22
most of the speed samples are in the lower section that is result in the Gamma function.
Figure 12 The PDF function of the speed on the collector road.
D. The Model of Local Road
In Bangkok, many local roads are used to connect between major and collector road. Vehicles in local road tend to be stopped at a traffic light for a longer period due to its low priority for traffic light management. In addition, it has many traffic lights and many non-traffic light junction as well as car parking alongside the road, and those tend to slow down the traffic. For our data collection, we chose the Prachacheun road since the road is straight, but has many traffic light. It also crosses many major road such as Tivanon Road and Vibhavadi Road. We collect each data sample at every 2 second, and totally we have 382 values. The result of the curve fitting model of the data collected on the local road is shown in Figure 13. The distribution of the collected data can be represented as Gamma distribution similar to that of the collector. This is due to the nature of the road with traffic lights where vehicles may be stopped or slowed down. However, the average vehicle speed of the local road from the distribution is much lower than that of the collector road.
From the results, it is shown that the distribution function of the speed of vehicles on different types of roads can be different. The speed distribution function is not uniformly random as assumed in several mobility model for network simulation. The distribution of roads tends to be either Weibull or Gamma distribution function. The speed distribution of the road with no traffic light is probably a Weibull distribution while that of the road with traffic light is probably a Gamma distribution. The average speed of each road type tends to be affected by the number of lanes and some other properties such as the narrowness of the available lane that can be narrowed by parking cars along the road. In addition, the average speed of the road can be unique to that road. Further studies are required to find the “identity” or the “fingerprint” of such road.
Figure 13 The PDF function of the speed on the local road.
V. CONCLUSION
In this paper, we proposed a statistical-based car following model concept that integrates the uniqueness of each road type into the calculation of the speed of the car following model. The uniqueness of the road can occur due to the different road structure that is very specific to each area of the road networks. The behavior of drivers on each road types can be different. We have found that the road types such as the expressway or the major road tend to have a speed distribution function as a Weibull distribution, but with the different average speed. However, for the collector road type of the local road type, the speed distribution function tends to be a Gamma distribution, but with the different average speed. This may be concluded that the speed of the road without traffic light may be modeled as the Weibull distribution, and that of the road with traffic light may be modeled as the Gamme distribution. However, the average speed of the distribution may be varies based on the nature of such roads. Further research is necessary to investigate into details of traffic data.
REFERENCES [1] NCTU-ns simulation software, available at
http://nsl.csie.nctu.edu.tw/nctuns.html, last accessed date: 30/1/2012. [2] The ns-2 simulation software, available at http://www.isi.edu/nsnam/ns/,
last accessed date: 30/1/2012. [3] The ONE simulation software, available at
http://www.netlab.tkk.fi/tutkimus/dtn/theone/, last accessed date: 30/1/2012.
[4] F. Bai, N. Sadagopan, A. Helmy, “Important: a framework to systematically analyze the impact of mobility on performance of routing protocols for ad hoc networks”, in Proc. 22th IEEE Annual Joint Conference on Computer Communications and Networking INFOCOM’03, 2003, pp. 825-835.
[5] V. Davies,“ Evaluating mobility models within an ad hoc network”, Colorado School of Mines, Colorado, USA, Tech. Rep., 2000.
[6] F. K. Karnadi, Z. H. Mo, K. Chan Lan, “Rapid generation of realistic mobility models for VANET”, in Proc. IEEE Wireless Communications and Networking Conference, 2007,pp. 2506-2511.
[7] MOVE mobility model, available at http://lens.csie.ncku.edu.tw/Joomla_version/index.php/research-projects/past/18-rapid-vanet, last accessed date: 30/1/2012.
[8] “SUMO Simulation of Urban Mobility”, available at http://sumo.sourceforge.net/, last accessed date: 30/1/2012.
[9] J. HÄarri, F. Filali, C. Bonnet, and Marco Fiore. Vanetmobisim: Generating Realistic Mobility Patterns for VANETs. In VANET: '06: Proceedings of the 3rd international workshop on Vehicular ad-hoc networks, pages 96-97, ACM Press, 2006
The Eighth International Conference on Computing and Information Technology IC2IT 2012
23
Rainfall Prediction in the Northeast Region of Thailand
using Cooperative Neuro-Fuzzy Technique
Jesada Kajornrit1, Kok Wai Wong
2, Chun Che Fung
3
School of Information Technology, Murdoch University
South Street, Murdoch, Western Australia, 6150
Email: [email protected], [email protected], [email protected]
Abstract—Accurate rainfall forecasting is a crucial task for re-
servoir operation and flood prevention because it can provide an
extension of lead-time for flow forecasting. This study proposes
two rainfall time series prediction models, the Single Fuzzy Infe-
rence System and the Modular Fuzzy Inference System, which
use the concept of cooperative neuro-fuzzy technique. This case
study is located in the northeast region of Thailand and the pro-
posed models are evaluated by four monthly rainfall time series
data. The experimental results showed that the proposed models
could be a good alternative method to provide both accurate re-
sults and human-understandable prediction mechanism. Fur-
thermore, this study found that when the number of training data
was small, the proposed model provided better prediction accu-
racy than artificial neural networks.
Keywords-Rainfall Prediction; Seasonal Time Series; Artificial
Neural Networks; Fuzzy Inference System; Average-Based Interval.
I. INTRODUCTION
Rainfall forecasting is indispensable for water management because it can provide an extension of lead-time for flow fore-casting used in water strategic planning. This is especially im-portant when it is used in reservoir operation and flood preven-tion. Usually, rainfall time series prediction has used conven-tional statistical models and Artificial Neural Networks (ANN) [8]. However, such models are difficult to be interpreted by human analysts, because the prediction mechanism is in para-metric form. From a hydrologist’s point of view, the accuracy of prediction and an understanding in the prediction mechan-ism are equally important.
Fuzzy Inference System (FIS) uses the process of mapping from a given set of inputs variables to outputs based on a set of human understandable fuzzy rules [19]. In the last decades, FIS has been successfully applied to various problems [3], [4]. An advantage of FIS is that its decision mechanism is interpretable. As fuzzy rules are closer to human reasoning, an analyst could understand how the model performs the prediction. If neces-sary, the analyst could also make use of his/her knowledge to modify the prediction model [5]. However, the disadvantage of FIS is its lack of learning ability from the given data. In con-trast, an ANN is capable of adapting itself from training data. In many cases where human understanding in physical process is not clear, ANN has been used to learn the relationship be-tween the observing data [6]. However, the disadvantage of ANN is its black-box nature, which is difficult to be inter-preted. In order to combine the advantages of both models, this paper propose two rainfall time series prediction models, the Single Fuzzy Inference System (S-FIS) and the Modular Fuzzy
Inference System (M-FIS), which use the concept of coopera-tive neuro-fuzzy technique.
This paper is organized as follows; Section 2 discusses the related works and Section 3 describes the case study area. In-put identification and the proposed models are presented in Sections 4 and 5 respectively. Section 6 shows the experimen-tal results. Finally, Section 7 provides the conclusion of this paper.
II. SOFT COMPUTING TECHNIQUES IN HYDROLOGICAL TIME
SERIES PREDICTION
In the hydrological discipline, rainfall prediction is relative-ly difficult than other climate variables such as temperature. This is due to the highly stochastic nature in rainfall, which shows a lower degree of spatial and temporal variability. To address this challenge, ANN has been adopted in the past dec-ades. For example, Coulibaly and Evora [7] compared six dif-ferent ANNs to predict daily rainfall data. Among different types of ANN, they suggested that the Multilayer Perceptron, the Time-lagged Feedforward Network, and the Counter-propagation Fuzzy-Neural Network provided higher accuracy than the Generalized Radial Basis Function Network, the Re-current Neural Network and the Time Delay Recurrent Neural Network. Another work was Wu et al. [8]. They proposed the use of data-driven models with data preprocessing techniques to predict precipitation data in daily and monthly scale. They proposed three preprocessing techniques, namely, Moving Av-erage, Principle Component Analysis and Singular Spectrum Analysis to smoothen the time series data. Somvanshi et al. [1] confirmed in their work that ANN provided better accuracy than ARIMA model for daily rainfall time series prediction.
Time series prediction is not only used for rainfall data but also streamflow and rainfall-runoff modeling. Wang et al. [9] compared several computational models, namely, Auto-Regressive Moving Average (ARMA), ANN, Adaptive Neural-Fuzzy Inference System (ANFIS), Genetic Programming (GP) and Support Vector Machine (SVM) to predict monthly dis-charge time series. Their results indicated that ANFIS, GP and SVM have provided the best performance. Lohani [10] com-pared ANN, FIS and linear transfer model for daily rainfall-runoff model under different input domains. The results also showed that FIS outperformed linear model and ANN. Nayak et al. [11] and Kermani et al. [12] proposed the use of ANFIS model to river flow time series. In addition, Jain and Kumar [13] applied conventional preprocessing approaches (de-trended and de-seasonalized) to ANN for streamflow time se-ries data.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
24
Figure 1. The case study area is located in the northeast region of Thailand. The positions of four rainfall stations are illustrated by star marks.
Up to this point, among all works mentioned, FIS itself has not been used as widely as ANN for time series prediction. Especially for rainfall time series prediction, reports on appli-cations of FIS are limited. Thus, the primary aim of this study is to investigate an appropriate way to use FIS for rainfall time series prediction problem.
III. CASE STUDY AREA AND DATA
The case study described in this study is located at the northeast region of Thailand (Fig 1). Four rainfall time series selected are depicted in Fig 2. Table 1 shows the statistics of the datasets used. The data from 1981 to 1998 were used to calibrate the models and data from 1999 to 2001 were used to validate the developed models. This study used the models to predict one step-ahead, that is, one month. To validate the models, Mean Absolute Error (MAE) is adopted as given in equation (1). The Coefficient of Fit (R) is also used to confirm the results. The performance of the proposed model is com-pared with conventional Box-Jenkins (BJ) models, Autoregres-sive (AR), Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA) [1], [8], [10], [13] and [15].
𝑀𝐴𝐸 = 𝑂𝑖 − 𝑃𝑖 𝑚𝑖=1 𝑚 (1)
TABLE I. DATASETS’ STATISTICS
Statistics TS356010 TS381010 TS388002 TS407005
Mean 1303.34 889.04 1286.28 1319.70
SD 1382.98 922.99 1425.88 1346.80
Kurtosis -0.10 0.808 0.532 -0.224
Skewness 0.95 1.080 1.131 0.825
Minimum 0 0 0 0
Maximum 5099 4704 6117 5519
Latitude 104.13E 102.88E 104.05E 104.75E
Longitude 17.15N 16.66N 16.65N 15.50N
Altitude 176 164 155 129
(TS356010)
(TS381010)
(TS388002)
(TS407005)
Figure 2. The four selected monthly rainfall time series used in this study.
TS356010
TS381010
TS388002
TS407005
The Eighth International Conference on Computing and Information Technology IC2IT 2012
25
IV. INPUT IDENTIFICATION
In general, input of a time series model are normally based on previous data points (Lags). For BJ models, the analysis of autocorrelation function (ACF) and partial autocorrelation function (PACF) are used as a guide to identify the appropriate input. However, in the case of ANN or other related non-linear models, there was no theory to support the use of these func-tions [14]. Although some literatures addressed the applicabili-ty of ACF and PACF to non-linear models [15], other litera-tures preferred to conduct experiments to identify the appropri-ate input [11].
This study conducted an experiment to find an appropriate input based on data from five rainfall stations. Data from 1981 to 1995 were used for calibration and data from 1996 to 1998 were used for validation. By increasing the number of lags to ANNs, six different inputs models were prepared and tested. To predict x(t), first input model is x(t-1), second input model is x(t-1), x(t-2) and so on. Fig 3 shows the results from the experi-ment. In this figure, average normalized MAEs from five time series are illustrated in bold line. The results show that the MAE is the lowest at lag 5. The Five previous lags model is expected to be an appropriate input. Since increasing the num-ber of input lags dose not significantly improve the prediction performance, additional methods may be needed.
In the case of seasonal data, there are other methods to identify an appropriate input to improve the prediction accura-cy, for examples, using the Phase Space Reconstruction (PSR) [16] and adding time coefficient as a supplementary feature [2]. However, in the first method, large number of training data is needed. According to “The Curse of Dimensionality”, when the number of input dimensions increases, the number of training data must be increased as well [17]. In this case study, the number of record is limited to 15 years, which could be consi-dered as relatively small. Therefore it is more appropriate to add the time coefficient.
Time coefficient (Ct) was used to assist the model to scope prediction into specific period. It may be Ct = 2 (wet and dry period), Ct = 4 (winter spring summer and fall period), or Ct = 12 (calendar months). This study adopted Ct = 12 as supple-mentary features. In Fig 3, Ct is added to original input data and test with ANNs (light line). The results show that using Ct with 2 previous lags provided the lowest average MAE and it can improve the prediction performance up to 26% (dash line). So, the appropriate input used in this study should be rainfall from lag 1, lag 2 and Ct.
This experimental result is related to the work of Raman and Sunilkumar [18] who studied monthly inflow time series. In hydrological process, inflow is directly affected by rainfall, consequently, the characteristics of flow graph and rainfall graph are rather similar. They suggested using data from 2 pre-vious lags to ANN models, however, instead of using a single ANN, they created twelve ANN models for each specific month and use “month” to select associated model to feed data in. If one considers this model as a black-box, one can see that their input is inflow from 2 previous lags and Ct which relative-ly similar to this study
Figure 3. Average MAE measure of ANN models among different inputs.
V. THE PROPOSED MODELS
This paper adopted the Mandani approach fuzzy inference system [20] since such model is more intuitive than the Sugeno approach [21]. To reduce the computational cost, triangular Membership Function (MF) is used. This study proposed two FIS models, namely, the Single Fuzzy Inference System (S-FIS) and the Modular Fuzzy Inference System (M-FIS), which use the concept of cooperative neuro-fuzzy technique. In S-FIS model, there is one single FIS model. Rainfall data from lag 1, lag 2 and Ct are feed directly in to the model. In M-FIS model, there are twelve FIS models associated to the calendar month. The Ct is used to select associated model to feed in the rainfall data from lag 1 and lag 2. The architectural overview of these two models is shown in the Fig 4.
Fig 5 shows the general steps to create these FIS models. The first step is to calculate the appropriate interval length be-tween two consecutive MFs and then generate Mamdani FIS rule base model. At this step, Average-Based Interval is adopted. The second step is to create fuzzy rules. In this study, Back-Propagation Neural Networks (BPNN) is used to general-ize from the training data and then used to extract fuzzy rules.
Figure 4. The architectural overview of the S-FIS (top) and M-FIS (bottom) models
Xt Xt-1 Xt-2
Ct
Xt Xt-2
Xt-1
Ct
The Eighth International Conference on Computing and Information Technology IC2IT 2012
26
Figure 5. General steps to crate the S-FIS and M-FIS models
In the S-FIS model, the MFs of Ct are simply depicted in Fig 6 (a). For rainfall input, interval length between two con-secutive MFs is very important to be defined. When the length of the interval is too large, it may not be able to represent fluc-tuation in time series. On the other hand, when it is too small the objective of FIS will be diminished.
Huarng [22] proposed the Average-Based Interval to define the appropriate interval length of MFs for fuzzy time series data based on the concept that “at least half of the fluctuations in the time series are reflected by the effective length of inter-val”. The fluctuation in time series data is the absolute value of first difference of any two consecutive data. In this method, a half of the average value of all fluctuation in time series is de-fined as the interval length of consecutive two MFs. This me-thod was successfully applied in the work reported in [23]. In this paper, this method is adapted a little bit more to fit to the nature of rainfall time series for this application.
Figure 6. An example of membership functions in TS356010’s S-FIS model, Ct (a) and Rainfall (b)
Fig 6 (b) shows the rainfall’s MFs of S-FIS from station TS356010. One can see that there are two interval lengths. The point that the interval length changes is around the 50 percen-tile of all the data. The data is separated into the lower area and the upper area by using 50 percentile as the boundary. Aver-age-based intervals are calculated for both areas. Since the be-ginning and ending rainfall periods have smaller fluctuation than middle period, using smaller interval length is more ap-propriate [2]. In the M-FIS model, using two interval lengths is not necessary since each sub model is created according to the specific month.
As mentioned before, the drawback of FIS is the lack of learning ability from data. Such model needs experts or other supplementary procedure to help to create the fuzzy rules. In this study, the proposed methodology uses BPNN to learn the generalization features from the training data [5] and then is used to extract fuzzy rules. Once the BPNN was used to extract fuzzy rules, BPNN is not used anymore. The steps to create fuzzy rules are as follows:
Step 1: Training the BPNN with the training data. At this step, the BPNN is learned and generalized from the training data.
Step 2: Preparing the set of input data. The set of input data, in this case, are all the points in the input space where the degree of MF of FIS’s input is 1 in all dimension. This input data are the premise part of the fuzzy rules.
Step 3: Feeding the input data into the BPNN, the output of BPNN are mapped to the nearest MF of FIS’s output. This out-put data are consequence part of the fuzzy rule.
For example, considering the MFs in Figure 6, the input-output [3, 500, 750:1700] is replaced with fuzzy rule “IF Ct=Mar and Lag1=A3 and Lag2=A4 THEN Predicted=A6”. This step uses 1 hidden layer BPNN. The number of hidden nodes and input nodes are 3 for S-FIS and 2 for M-FIS.
VI. EXPERIMENTAL RESULTS
The experimental results are shown in Table 2 and Table 3. In the tables, S-ANN and M-ANN are the neural networks used to create fuzzy rules for S-FIS and M-FIS respectively. In fact, the S-ANN and M-ANN themselves are also the prediction models. The performance between S-ANN and S-FIS is quite similar. It can be noted that the conversion from ANN-based to FIS-based does not reduce the prediction performance of the ANN. However, this conversion improves the S-ANN model from a qualitative point of view since M-FIS is interpretable with a set of human understandable fuzzy rules. The interesting point is the performance between M-ANN and M-FIS. This conversion can improve the performance of M-ANN.
Next, the proposed models have been compared with three conventional BJ models. The comparison results are depicted in Fig 7. Since the results from MAE and R measures are con-solidated, these experimental results are rather consistent. Simi-lar to the work by Raman and Sunilkumar [18], the AR model uses degree 2 because it uses the same input as the proposed models. The ARIMA and SARIMA models used in the study are automatically generated and optimized by statistical soft-ware. However, these generated models were also rechecked to ensure that they provided the best accuracy.
0 2 4 6 8 10 12
0
0.2
0.4
0.6
0.8
1
(a)
De
gre
e o
f m
em
be
rsh
ip
jan feb mar apr may jun jul agu sep oct nov dec
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
0.2
0.4
0.6
0.8
1
(b)
De
gre
e o
f m
em
be
rsh
ip
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13
Train BPNN
Calculate Average-Based
Interval Length
Generate FIS Rule Base
and its MFs
Generate Fuzzy Rules
Training data
FIS model
The Eighth International Conference on Computing and Information Technology IC2IT 2012
27
TABLE II. MAE MEASURE OF VALIDATION PERIOD
Datasets S-ANN S-FIS M-ANN M-FIS AR ARIMA SARIMA
TS356010 450.99 447.56 560.44 496.35 747.37 747.01 538.99
TS381010 332.71 343.88 439.91 442.32 534.32 402.42 503.99
TS388002 736.70 725.39 811.99 639.29 912.64 856.88 714.74
TS407005 636.37 634.65 776.63 661.30 901.76 672.35 799.34
TABLE III. R MEASURE OF VALIDATION PERIOD
Datasets S-ANN S-FIS M-ANN M-FIS AR ARIMA SARIMA
TS356010 0.884 0.887 0.755 0.850 0.650 0.759 0.837
TS381010 0.719 0.709 0.606 0.668 0.464 0.733 0.575
TS388002 0.760 0.773 0.712 0.871 0.606 0.685 0.769
TS407005 0.768 0.770 0.633 0.736 0.594 0.755 0.681
In term of MAE, among the three BJ models, the AR model provided the lowest accuracy in all datasets. ARIMA show higher accuracy than SARIMA in two of the datasets. In station TS356010 and TS407005 the proposed model shows higher performance than all BJ models, especially the S-FIS model. In station TS381010, the ARIMA model is better than M-FIS but the performance is lower than S-FIS. In station TS388002, SARIMA model showed better performance than S-FIS but lower than M-FIS. The average normalized MAE and average R measure from all datasets are shown in the Fig 8. It can be seen from the figure that, overall, the proposed models per-formed better than the results generated from AR, ARIMA and SARIMA model.
All aforementioned results are based on quantitative point of view in order to validate the experimental results. In qualita-tive point of view, the proposed model is easier to interpret than other models because the decision mechanism of such models is in the fuzzy rules form which is close to human rea-soning [5]. Furthermore, when the models are in the form of rule base, it is easier for further enhancement and optimization by human expert. The advantage of S-FIS model is that time coefficient is expressed in term of MFs, so it is possible to ap-ply optimization method to this feature. However, a large num-ber of fuzzy rules are needed for single model. On the other hand, M-FIS model has smaller number of fuzzy rules when compared to S-FIS, but such model does not use any time fea-ture.
VII. CONCLUSION
Accurate rainfall forecasting is crucial for reservoir opera-tion and flood prevention because it can provide an extension of lead-time of the flow forecasting and many time series pre-diction models have been applied. However, the prediction mechanism of those models may be difficult to be interpreted by human analysts. This study proposed the Single Fuzzy Infe-rence System and the Modular Fuzzy Inference System, which use the concept of cooperative neuro-fuzzy technique to predict monthly rainfall time series in the northeast region of Thailand. The reported models used the average-based interval method
to determine the fuzzy interval and use BPNN to extract fuzzy rules. The prediction performance of the proposed models is compared with conventional Box-Jenkins models. The experi-mental results showed that the proposed models could be a good alternative. Furthermore, the prediction mechanism can be interpreted through the human understandable fuzzy rules.
(a)
(b)
Figure 7. The comparison performance between the purposed models and conventional Box-Jenkins models: MAE (a) and R (b).
The Eighth International Conference on Computing and Information Technology IC2IT 2012
28
(a)
(b)
Figure 8. The average normalized MAE (a) and average R (b) of all datasets
REFERENCES
[1] V. K. Somvanshi, et al., “Modeling and prediction of rainfall using artificial neural network and ARIMA techniques.” J. Ind. Geophys. Union, vol. 10, no. 2, pp. 141-151, 2006.
[2] Z. F. Toprak, et al., “Modeling monthly mean flow in a poorly
gauged basin by fuzzy logic,” Clean, vol. 37, no. 7, pp. 555-567, 2009.
[3] S. Kato and K. W. Wong, “Intelligent Automated Guided Ve-hicle with Reverse Strategy: A Comparison Study,” in Mario Köppen, Nikola K. Kasabov, George G. Coghill (Eds.) Ad-vances in Neuro-Information Processing, Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg, pp. 638-646, 2009.
[4] K. W. Wong, and T. D. Gedeon, "Petrophysical Properties Pre-
diction Using Self-generating Fuzzy Rules Inference System with Modified Alpha-cut Based Fuzzy Interpolation", Proceed-ings of The Seventh International Conference of Neural Infor-mation Processing ICONIP, pp. 1088-1092, November 2000, Korea.
[5] K. W. Wong, P. M. Wong, T. D. Gedeon, C. C. Fung, “Rainfall Prediction Model Using Soft Computing Technique,” Soft Computing, vol 7, issue 6, pp. 434-438, 2003.
[6] C. C. Fung, K. W. Wong, H. Eren, R. Charlebois, and H. Crock-er, “Modular Artificial Neural Network for Prediction of Petro-physical Properties from Well Log Data,” in IEEE Transactions on Instrumentation & Measurement, 46(6), December, pp.1259-1263, 1997.
[7] P. Coulibaly and N. D. Evora, “Comparison of neural network methods for infilling missing daily weather records.” Journal of Hydrology, vol. 341 pp. 27-41, 2007.
[8] C. L. Wu, K. W. Chau, and C. Fan, “Prediction of rainfall time series using modular artificial neural networks coupled with da-
ta-preprocessing techniques.” Journal of Hydrology, vol. 389, pp.146-167, 2010.
[9] W. Wang, K. Chau, C. Cheng and L. Qiu, “A comparison of performance of several artificial intelligence methods for fore-casting monthly discharge time series.” Journal of Hydrology, vol. 374, pp. 294-306, 2009.
[10] A. K. Lohani, N. K. Goel and K. K. S. Bhatia, “Comparative study of neural network, fuzzy logic and linear transfer function
techniques in daily rainfall-runoff modeling under different in-put domains.” Hydrological Process, vol. 25, pp. 175-193, 2011.
[11] P. C. Nayak, et al., A neuro-fuzzy computing technique for modeling hydrological time series, Journal of Hydrology, vol. 291. pp. 52-66, 2004.
[12] M. Z. Kermani and M. Teshnehlab, “Using adaptive neuro-fuzzy inference system for hydrological time series prediction.” Ap-plied Soft Computing, vol. 8, pp. 928-936, 2008.
[13] A. Jain and A. M. Kumar, “Hybrid neural network models for hydrologic time series forecasting.” Applied Soft Computing, vol. 7, pp. 585-592, 2007.
[14] M. Khaashei, M. Bijari, G. A. r. Ardali, “Improvement of Auto-Regressive Integrated Moving Average models using Fuzzy log-ic and Artificial Neural Networks (ANNs),” Neurocomputing, vol. 72, pp. 956-967, (2009).
[15] K. P. Sudheer, “A data-driven algorithm for constructing artifi-
cial neural network rainfall-runoff models,” Hydrological Pre-cess, vol. 16, pp. 1325-1330, (2002).
[16] C. L. Wu and K. W. Chau, “Data-driven models for monthly streamflow time series prediction.” Engineering Applications of Artificial Intelligence, vol. 23, pp. 1350-1367, 2010.
[17] S. Marsland, “Machine Learning An Algorithmic Perspective” CRC Press, 2009.
[18] H. Raman and N. Sunilkumar, “Multivariate modeling of water resources time series using artificial neural network,” Hydrolog-
ical Sciences –Journal- des Sciences Hydroligiques, vol. 40, pp.145-163, 1995.
[19] L. A. Zadeh, “Fuzzy Sets,” Inform and Control, vol. 8, pp. 338 – 353. 1965.
[20] E. H. Mamdani, and S. Assilian, “An experiment in linguistic synthesis with fuzzy logic controller,” International journal of man-machine studies, vol. 7 no. 1, pp.1-13, 1975.
[21] M. Sugeno, “Industrial application of fuzzy control,” North-
Holland, Amsterdam. 1985. [22] K. Huarng, “Effective lengths of intervals to improve forecast-
ing in fuzzy time series,” Fuzzy sets and system, vol. 123. Pp. 387-394, 2001.
[23] H. Liu and M. Wei, “An improved fuzzy forecasting method for seasonal time series,” Expert System with Applications, vol. 37, pp. 6310-6318, 2010.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
29
Interval-valued Intuitionistic Fuzzy ELECTRE Method
Ming-Che Wu Graduate Institute of Business and Management College of Management, Chang Gung University
Taoyuan 333, Taiwan [email protected]
Ting-Yu Chen Department of Industrial and Business Management
College of Management, Chang Gung University Taoyuan 333, Taiwan
Abstract—In this study, the proposed method replaced the evaluation data from crispy value to vague value, i.e. interval-valued intuitionistic fuzzy (IVIF) data, and to develop the IVIF Elimination and Choice Translating Reality (ELECTRE) method for solving the multiple criteria decision making problems. The analyst can use IVIF sets characteristics to classify different kinds of concordance (discordance) sets using score and accuracy function, membership uncertainty degree, hesitation uncertainty index and then applied the proposed method to select the better alternatives.
Keywords-interval-valued intuitionistic fuzzy; ELECTRE; multiple criteria decision making; score function; accuracy function
I. INTRODUCTION The Elimination et Choice Translating Reality (ELECTRE)
method is one of the outranking relation methods and it was first introduced by Roy [3]. The threshold values in the classical ELECTRE method are playing an importance role to filtering alternatives, and different threshold values produce different filtering results. As we known that the evaluation data in classical ELECTRE method are almost exact values that can affect the threshold values. Moreover, in real world cases, exact values could be difficult to be precisely determined since analysts’ judgments are often vague; for these reasons, we can find some studies [4,5,8] developed the ELECTRE method with type 2 fuzzy data. Vahdani and Hadipour [4] presented a fuzzy ELECTRE method using the concept of the interval-valued fuzzy set (IVFS) with unequal criteria weights, and the criteria values are considered as triangular interval-valued fuzzy number, and also using triangular interval-valued fuzzy number to distinguish the concordance and discordance sets, and then to solve multi-criteria decision-making (MCDM) problems. Vahdani et al. [5] proposed an ELECTRE method using the concepts of interval weights and data to distinguish the concordance and discordance sets, and then to evaluate a set of alternatives and applied it to the problem of supplier selection. Wu and Chen [8] proposed an intuitionistic fuzzy (IF) ELECTRE method that using the concept of score and accuracy function, i.e. calculated the different combinations of membership, non-membership functions and hesitancy degree, to distinguish different kinds of concordance and discordance
sets, and then using the result to rank all alternatives, for solving MCDM problems.
The intuitionistic fuzzy set (IFS) was first introduced by Atanassov [1], and the IFS generalize the fuzzy set, which was introduced by Zadeh [11]. The interval-valued intuitionistic fuzzy set (IVIFS), that is combined IFS concept with interval valued fuzzy set concept, introduced by Atanassov and Gargov [2], each of which is characterized by membership function and non-membership function whose values are interval rather than exact numbers, are a very useful means to describe the decision information in the process of decision making.
As the literature review shows, few studies have applied the ELECTRE method with IVIFS to real life cases. The main purpose of this paper is to further extend the ELECTRE method to develop a new method to solve MCDM problems in interval-valued intuitionistic fuzzy (IVIF) environments. The major difference between the current study and other available papers is the proposed method, whose logic is simple but which is suitable for the vague of real life situations. The proposed method that also using the score and accuracy function, and added 2 more factors, membership and hesitation uncertainty index, i.e. applied the factors of membership, non-membership functions and hesitancy degree, to distinguish different kinds of concordance and discordance sets, and then to select the best alternatives finally. The remainder of this paper is organized as follows. Section 2 introduces the decision environment with IVIF data, the score, accuracy functions and some indices, and the construction of the IVIF decision matrix. Section 3 introduces the IVIF ELECTRE methods and its algorithm. Section 4 illustrates the proposed method with a numerical example. Section 5 presents the discussion.
II. DECISION ENVIRONMENT WITH IVIF DATA
A. Interval-valued intuitionistic fuzzy sets Based on the definition of IVIFS in Atanassov and Gargov
study [2], we have:
Definition 1: Let X be a non-empty set of the universe, and [ ]0,1D be the set of all closed subintervals of all closed
subintervals of [ ]0,1 . An IVIFS A% in X is an expression defined by
This research is supported by the National Science Council (No. NSC 99-2410-H-182-022-MY3).
The Eighth International Conference on Computing and Information Technology IC2IT 2012
30
, ( ), ( ) |A AA x M x N x x X= ⟨ ⟩ ∈% %% % %
,[ ( ), ( )],[ ( ), ( )] | ,L U L UA A A Ax M x M x N x N x x X= ⟨ ⟩ ∈% % % % (1)
where ( ) : [0,1]AM x X D→%% and ( ) : [0,1]AN x X D→%
% denote the membership degree and the non-membership degree for any x X∈ , respectively. ( )AM x%
% and ( )AN x%% are closed
intervals rather than real numbers and their lower and upper boundaries are denoted by ( )L
AM x% , ( )UAM x% , ( )L
AN x% and
( )UAN x% , respectively, and 0 ( ) ( ) 1U U
A AM x N x≤ + ≤% % .
Definition 2: [2] For each element x , the hesitancy degree of an intuitionistic fuzzy interval of x X∈ in A% defined as follows:
( ) 1 ( ) ( )A A Ax M x N xπ = − −% % %% %%
[1 ( ) ( ),1 ( ) ( )]U U L LA A A AM x N x M x N x= − − − −% % % %
[ ( ), ( )]L UA Ax xπ π= % % . (2)
Definition 3: The operations of IVIFS [2,9] are defined as follows: for two of , IVIFS( )A B X∈ ,
(a) A B⊂ iff ( ) ( )L LBAM x M x≤% % , ( ) ( )U U
BAM x M x≤% % and
( ) ( )L LBAN x N x≥% % , ( ) ( )U U
BAN x N x≥% % ;
(b) A B= iff andA B B A⊂ ⊂ ;
(c) 11
1( , ) [| ( ) ( ) | | ( )4
n L L Uj j jBA A
jd A B M x M x M x
== − +∑ % % %
( ) | | ( ) ( ) | | ( )U L L Uj j j jB BA AM x N x N x N x− + − +% % % %
( ) |];UjBN x− %
(d) 21
1( , ) [| ( ) ( ) | | ( )4
n L L Uj j jBA A
jd A B M x M x M x
n == − +∑ % % %
( ) | | ( ) ( ) | | ( )U L L Uj j j jB BA AM x N x N x N x− + − +% % % %
( ) |]UjBN x− % ;
(e) 31
1( , ) [| ( ) ( ) | | ( )4
n L L Uj j j jBA A
jd A B w M x M x M x
== − + −∑ % % %
( ) | | ( ) ( ) |U L Lj j jB BAM x N x N x− + − +% % %
| ( ) ( ) |]U Uj jBAN x N x−% % , (3)
where 1 2, ,...j nw w w w= is the weight vector of the elements
( 1,2,..., )jx j n= . The 1 2 3( , ), ( , )and ( , )d A B d A B d A B are the Hamming distance, normalized Hamming distance, and weighted Hamming distance, respectively.
B. The score, accuracy functions and some indices The studies reviews of score and accuracy functions to
handle multi-criteria fuzzy decision-making problems are as follows. At definition 1, an IVIFS A% in X is defined as
,[ ( ), ( )],[ ( ), ( )] | ,L U L UA A A AA x M x M x N x N x x X= ⟨ ⟩ ∈% % % %
% for
convenience, we call [ ( ), ( )],[ ( ),n n n
L U Ln A A AA M x M x N x= ⟨ % % %%
( )]n
UAN x ⟩% an interval-valued intuitionistic fuzzy number
(IVIFN) [10], where [ ( ), ( )] [0,1]n n
L UA AM x M x ⊂% % , [ ( ),
n
LAN x%
( )] [0,1]n
UAN x ⊂% , and ( ) ( ) 1
n n
U UA AM x N x+ ≤% % .
Xu [10] defined a score function s to measure the degree of suitability of an IVIFN nA% as follows.
1( ) ( ( ) ( ) ( ) ( ))2 n n n n
L L U Un A A A As A M x N x M x N x= − + −% % % %% , where
( ) [ 1,1]ns A ∈ −% . The larger the value of ( )ns A% , the higher the
degree of the IVIFN nA% . Wei and Wang [7] defined an
accuracy function h to evaluate the accuracy degree of an nA% as follows.
1( ) ( ( ) ( ) ( ) ( ))2 n n n n
L U L Un A A A Ah A M x M x N x N x= + + +% % % %% , where
( ) [0,1]nh A ∈% . The larger the value of ( )nh A% , the higher the degree of the IVIFN nA% . The membership uncertainty index T was proposed [6] to evaluate the membership uncertainty degree of an IVIFN nA% as follows. ( ) ( )
n
Un AT A M x= +%%
( )n
LAN x% ( ) ( )
n n
L UA AM x N x− −% % , where 1 ( ) 1nT A− ≤ ≤% . The
larger value of ( )nT A% , the smaller of the IVIFN nA% .
The hesitation uncertainty index G of a nA% is defined as
follows. ( ) ( ) ( ) ( ) ( )n n n n
U U L Ln A A A AG A M x N x M x N x= + − −% % % %% ,
and the larger value of ( )nG A% , the smaller of the IVIFN nA% .
In the study, we classify different types of concordance and discordance sets with the concepts of score, accuracy functions, membership uncertainty and hesitation uncertainty index at the proposed method.
C. Construction of the IVIF decision matrix We extend the canonical matrix format to an IVIF decision
matrix M% . An IVIFS iA% of the ith alternative on X is given by
,i j ij jA x X x X= ⟨ ⟩ | ∈% % ,
where ([ ( ), ( )],[ ( ), ( )])L U L Uij A A A AX M x M x N x N x= % % % %% .
The ijX% indicate the degrees of membership and non-membership interval of the ith alternative with respect to the jth criterion. The IVIF decision matrix M% can be expressed as follows:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
31
1 11 1
1
⎡ ⎤⎢ ⎥
= ⎢ ⎥⎢ ⎥⎣ ⎦
% %L% M M O M
% %L
n
m m mn
A X XM
A X X
11 11 11 11 1 1 1 1
1 1 1 1
([ , ],[ , ]) . ([ , ],[ , ]). . .
([ , ],[ , ]) . ([ , ],[ , ])
L U L U L U L Un n n n
L U L U L U L Um m m m mn mn mn mn
M M N N M M N N
M M N N M M N N
⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦
. (4)
An IVIFS W , a set of grades of importance, in X is defined as follows:
,j j j jW x w x x X= ⟨ ( )⟩ | ∈ , (5)
where 0 ( ) 1j jw x≤ ≤ , 1
( ) 1n
j jj
w x=
=∑ , and ( )j jw x is the degree
of importance assigned to each criterion.
III. ELECTRE METHOD WITH IVIF DATA The proposed method is utilized the concept of score and
accuracy function to distinguish concordance set and the discordance set from the evaluation information with IVIFS data, and then to construct the concordance, discordance, concordance (discordance, aggregate) dominance matrix, respectively, and to select the best alternative from the aggregate dominance matrix finally. In this section, the IVIF ELECTRE method and its algorithm are introduced and used throughout this paper.
A. The IVIF ELECTRE method The concordance and discordance sets with IVIF data and
their definitions are as follows.
Definition 4: The concordance set klC is defined as
1 L L U Ukl kj kj kj kjC j M N M N= − + − >¢ x
+ ,L L U Ulj lj lj ljM N M N− − (6)
2 L U L Ukl kj kj kj kjC j M M N N= + + + >¢ x
+ L U L Ulj lj lj ljM M N N+ +
when ( ) ( )kj ljs X s X=% % , (7)
3
+
U L L Ukl kj kj kj kj
U L L Ulj lj lj lj
C j M N M N
M N M N
= + − − <
− −
¢ x
when ( ) ( )kj ljh X h X=% % , (8)
4
U U L Lkl kj kj kj kj
U U L Llj lj lj lj
C j M N M N
M N M N
= + − − ≤
+ − −
¢ x
when ( ) ( )kj ljT X T X=% % , (9)
where 1 2 3 4 , , , kl kl kl kl klC C C C C= , | 1,2,..., J j j n= = , and
,kj ljX X% % stand for the lower and upper boundaries of alternative k and l in criterion j, respectively.
The ( )kjs X% , ( )kjh X% and ( )kjT X% are score, accuracy fun-ction and membership uncertainty index, respectively, which are defined in section II. B.
Definition 5: The discordance set klD is defined as
1
+ ,
L L U Ukl kj kj kj kj
L L U Ulj lj lj lj
D j M N M N
M N M N
= − + − <
− −
¢ x (10)
2
+
L U L Ukl kj kj kj kj
L U L Ulj lj lj lj
D j M M N N
M M N N
= + + + <
+ +
¢ x
when ( ) ( )kj ljs X s X=% % , (11)
3
+
U L L Ukl kj kj kj kj
U L L Ulj lj lj lj
D j M N M N
M N M N
= + − − >
− −
¢ x
when ( ) ( )kj ljh X h X=% % , (12)
4
U U L Lkl kj kj kj kj
U U L Llj lj lj lj
D j M N M N
M N M N
= + − − >
+ − −
¢ x
when ( ) ( )kj ljT X T X=% % , (13)
where 1 2 3 4 , , , kl kl kl kl klD D D D D= .
The relative value of the concordance set of the IVIF ELECTRE method is measured through the concordance index. The concordance index klg between kA and lA is defined as:
klkl C j j
j Cg w xω
∈= × ( )∑ , (14)
where Cω is the weight of the concordance set, and j jw x( ) is defined in (5).
The concordance matrix G is defined as follows:
12 1
21 23 2
1 1 1
1 2 1
( ) ( )
( )
... ......
... ... ... ...... ...
...
m
m
m m m
m m m m
g gg g g
Gg g
g g g− −
−
−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥
−⎢ ⎥⎢ ⎥−⎣ ⎦
, (15)
where the maximum value of klg is denoted by *g .
The evaluation of a certain kA are worse than the eva-luation of competing lA .
The discordance index is defined as follows:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
32
max ( , )
max ( , )kl
D kj ljj Dkl
kj ljj J
d X Xh
d X X
ω∈
∈
×= , (16)
where ( , )kj ljd X X is defined in (3), and Dω is the weights of discordance set on IVIF ELECTRE method.
The discordance matrix H is defined as follows:
12 1
21 23 2
( 1)1 ( 1)
1 2 ( 1)
... ......
... ... ... ...... ...
...
m
m
m m m
m m m m
h hh h h
Hh h
h h h− −
−
−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥
−⎢ ⎥⎢ ⎥−⎣ ⎦
, (17)
where the maximum value of klh is denoted by *h that is more discordant than the other cases.
The concordance dominance matrix K is defined as follows:
12 1
21 23 2
( 1)1 ( 1)
1 2 ( 1)
... ......
... ... ... ...... ...
...
m
m
m m m
m m m m
k kk k k
Kk k
k k k− −
−
−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥
−⎢ ⎥⎢ ⎥−⎣ ⎦
, (18)
where *kl klk g g= − , and a higher value of klk indicates that
kA is less favorable than lA .
The discordance dominance matrix L is defined as follows:
12 1
21 23 2
( 1)1 ( 1)
1 2 ( 1)
... ......
... ... ... ...... ...
...
m
m
m m m
m m m m
l ll l l
Ll l
l l l− −
−
−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥
−⎢ ⎥⎢ ⎥−⎣ ⎦
, (19)
where *kl kll h h= − , a higher value of kll indicates that kA is
preferred over lA .
The aggregate dominance matrix R is defined as follows:
12 1
21 23 2
( 1)1 ( 1)
1 2 ( 1)
... ......
... ... ... ...... ...
...
m
m
m m m
m m m m
r rr r r
Rr r
r r r− −
−
−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥
−⎢ ⎥⎢ ⎥−⎣ ⎦
, (20)
where
klkl
kl kl
lr
k l=
+, (21)
klk and kll are defined in (18) and (19), and klr is in the range from 0 to 1. A higher value of klr indicates that the alternative
kA is more concordant than the alternative lA ; thus, it is a better alternative. In the best alternatives selection process,
1,
11
mk kl
l l kT r
m = ≠= ∑
−, 1, 2,...,k m= , (22)
and kT is the final value of the evaluation. All alternatives can be ranked according to the value of kT . The best alternative
*A with *
kT can be generated and defined as follows:
*( *) max k kT A T= , (23)
where *
kT is the final value of the best alternative and *A is the best alternative.
B. Algorithm The algorithm and decision process of the IVIF ELECTRE
method can be summarized in the following four steps, and there are calculate the concordance, discordance matrices, construct the concordance dominance, discordance dominance matrices and determine the aggregate dominance matrix in the Step 3. Figure 1 illustrates a conceptual model of the proposed method.
1.Construct the decision matrix
Using (4), (5)
2.Identify the concordance and discordance sets
Using (6)-(13)
3.Calculate the matrices
Using (14)-(21)
4.Choose the best alternative
Using (22),(23)
Figure 1. The process of the IVIF ELECTRE method algorithm.
IV. NUMERICAL EXAMPLE In this section, we present an example that is connected to a
decision-making problem with the best alternative selection. Suppose a potential banker intends to invest the money from four possible alternatives (companies), named A1, A2, A3, and A4. The criteria of a company is 1x (risk analysis), 2x (the growth analysis), and 3x (the environmental impact analysis) in the selection problem. The subjective importance levels of the different criteria W are given by the decision makers:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
33
1 2 3 0 35 0 25 0 4W w w w= =[ , , ] [ . , . , . ] . The decision makers also give the relative weights as follows:
1 1' [ , ] [ , ]C DW w w= = . The IVIFS decision matrix decision M% is given with cardinal information:
11 11 11 11 1 1 1 1
1 1 1 1
([ , ],[ , ]) . ([ , ],[ , ]). . .
([ , ],[ , ]) . ([ , ],[ , ])
L U L U L U L Un n n n
L U L U L U L Um m m m mn mn mn mn
M M N N M M N NM
M M N N M M N N
⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦
%
0 4 0 5 0 3 0 4 0 4 0 6 0 2 0 4 0 1 0 3 0 5 0 60 4 0 6 0 2 0 3 0 6 0 7 0 2 0 3 0 4 0 7 0 1 0 20 3 0 6 0 3 0 4 0 5 0 6 0 3 0 4 0 5 0 6 0 1 0 30 7 0 8
=
([ . , . ],[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])([ . , . ],[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])([ . , . ],[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])([ . , . ] 0 1 0 2 0 6 0 7 0 1 0 3 0 3 0 4 0 1 0 2
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦,[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])
( Step 1 has completed. )
Applying Step 2, the concordance and discordance sets are identified using the result of Step 1.
The concordance set, applying (6) - (9), is:
1 3 1 3 1 31 2 3 1 2 3 2 32 3 1 2 3 2 3
1 2 3 1 2 3 1 2 3
klC
−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥
−⎣ ⎦
, , ,, , , , ,
, , , ,, , , , , ,
.
For example, 24C , which is in the 2nd (horizontal) row and the 4th (vertical) column of the concordance set, are “2,3”.
The discordance set, obtained by applying (10) - (13), is as follows:
2 2 21
1 1klD
−⎡ ⎤⎢ ⎥− − −⎢ ⎥=⎢ ⎥− −⎢ ⎥− − − −⎣ ⎦
.
Applying Step 3, the concordance matrix is calculated.
0 8 0 8 0 81 1 0 5
0 5 1 0 51 1 1
. . ..
. .G
−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥
−⎣ ⎦
. For example,
1 0.35 1 0.25 1 0.40 1.0= × + × + × = .
The discordance matrix is calculated:
0 267 0 143 0 3570 0 1
0 143 0 10 0 0
. . .
.H
−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥
−⎣ ⎦
.
For example:
121 2
121 2
max ( , )0.100 0.267
max ( , ) 0.375
D j jj D
j jj J
w d X Xh
d X X∈
∈
×= = = ,
where
13 231( , ) ( 0.1 0.4 0.3 0.74
d X X = − + − +
0.5 0.1 0.6 0.2 ) 0.375− + − = ,
and
12 221( , ) 1 ( ( 0.4 0.6 0.6 0.74Dw d X X× = × − + −
0.2 0.2 0.4 0.3 )) 0.100+ − + − = .
The concordance dominance matrix is constructed as follows.
0 2 0 2 0 20 0 0 5
0 5 0 0 50 0 0
. . ..
. .K
−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥
−⎣ ⎦
.
The discordance dominance matrix is constructed as follows.
0 733 0 857 0 6431 1 0
0 857 1 01 1 1
. . .
.L
−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥
−⎣ ⎦
.
The aggregate dominance matrix is determined:
0 786 0 811 0 7631 1 0
0 632 1 01 1 1
. . .
.R
−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥
−⎣ ⎦
.
Applying Step 4, the best alternative is chosen:
1 0.786T = , 2 0.667T = , 3 0.544T = , 4 1.000T = .
The optimal ranking order of alternatives is given by 4 1 2 3A A A Af f f . The best alternative is 4A .
V. D ISCUSSION In this study, we provide a new method, the IVIF
ELECTRE method, for solving MCDM problems with IVIF information. A decision maker can use the proposed method to gain valuable information from the evaluation data provided by users, who do not usually provide preference data. Decision makers utilize IVIF data instead of single values in the evaluation process of the ELECTRE method and use those data to classify different kinds of concordance and discordance sets
21 1 2 3C C Cg w w w w w w= × + × + ×
The Eighth International Conference on Computing and Information Technology IC2IT 2012
34
to fit a real decision environment. This new approach integrates the concept of the outranking relationship of the ELECTRE method. In the proposed method, we can classify different types of concordance and discordance sets using the concepts of score function, accuracy function, membership uncertainty degree, hesitation uncertainty index, and use concordance and discordance sets to construct concordance and discordance matrices. Furthermore, decision makers can choose the best alternative using the concepts of positive and negative ideal points. We used the proposed method to rank all alternatives and determine the best alternative. This paper is the first step in using the IVIF ELECTRE method to solve MCDM problems. In a future study, we will apply the proposed method to predict consumer decision making using a questionnaire in an empirical study of service providers selecting issue.
REFERENCES
[1] K. T. Atanassov, “Intuitionistic fuzzy sets,” Fuzzy sets and Systems, vol. 20, pp. 87-96, 1986.
[2] K. Atanassov and G. Gargov, “Interval valued intuitionistic fuzzy sets,” Fuzzy sets and Systems, vol. 31, pp. 343-349, 1989.
[3] B. Roy, “Classement et choix en présence de points de vue multiples (la méthode ELECTRE),” RIRO, vol. 8, pp. 57-75, 1968.
[4] B. Vahdani and H. Hadipour, “Extension of the ELECTRE method based on interval-valued fuzzy sets,” Soft Computing, vol. 15, pp. 569-579, 2011.
[5] B. Vahdani, A. H. K. Jabbari, V. Roshanaei, and M. Zandieh, “Extension of the ELECTRE method for decision-making problems with interval weights and data,” International Journal of Advanced Manufacturing Technology, vol. 50, pp. 793-800, 2010.
[6] Z. Wang, K. W. Li, and W. Wang, “An approach to multiattribute decision making with interval-valued intuitionistic fuzzy assessments and incomplete weights,” Information Sciences, vol. 179, pp. 3026-3040, 2009.
[7] G. W. Wei, and X. R. Wang, “Some geometric aggregation operators on interval-valued intuitionistic fuzzy sets and their application to group decision making,” International conference on computational intelligence and security, pp. 495-499, December 2007.
[8] M.-C. Wu and T.-Y. Chen, “The ELECTRE multicriteria analysis approach based on Atanassov's intuitionistic fuzzy sets,” Expert Systems with Applications, vol. 38, pp. 12318-12327, 2011.
[9] Z. S. Xu, “On similarity measures of interval-valued intuitionistic fuzzy sets and their application to pattern recognitions,” Journal of Southeast University, vol. 23, pp. 139 -143, 2007a.
[10] Z. S. Xu, “Methods for aggregating interval-valued intuitionistic fuzzy information and their application to decision making,” Control and Decision, vol. 22, pp. 215 -219, 2007b.
[11] L. A. Zadeh, “Fuzzy Sets,” Information and Control, vol. 8, pp. 338-353, 1965.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
35
Optimizing of Interval Type-2 Fuzzy Logic Systems Using HybridHeuristic Algorithm Evaluated by Classification
Adisak Sangsongfa, and Phayung MeesadDepartment of Information Thecnology, Faculty of Information Technology
King Mongkut’s University of Thecnology North Bangkok, Bangkok, ThailandEmail: adisak [email protected], [email protected]
Abstract— In this research, an optimization of the rule baseand the parameter of interval type-2 fuzzy set generation bya hybrid heuristic algorithm using particle swarm and geneticalgorithms is proposed for classification application . For the Irisdata set, 90 records were selected randomly for training, and therest, 60 records, were used for testing. For the Wisconsin BreastCancer data set, the author deleted the missing attribute valueof 16 records and randomly selected 500 records for training,and the rest, 183 records, were used for testing. The proposedmethod was able to minimize rulebase, minimize linguisticvariable and produce a accurate classification at 95% with thefirst dataset and 98.71% with the second dataset.
Keywords-Interval Type-2 Fuzzy Logic Systems; GA; PSO;
I. INTRODUCTION
In 1965, Lotfi A. Zadeh, professor for computer science atthe University of California in Berkley, developed a fuzzylogic system which has been widely used in many areassuch as decision making, classification, control, prediction,optimization and so on. However, the fuzzy logic systemcomes from the original system that is called the type-1 fuzzyset. Sometimes it cannot solve certain problems, especiallyproblems that are very large, complex and/or uncertain.Therefore, in 1975 Zadeh developed and formulated a type-2fuzzy set to meet the needs of data sets which are complexand uncertain. Thus, the type-2 fuzzy set has been usedwidely and continuously in many cases [1].
Recently, there has been growing interest in the inter-val type-2 fuzy set which is a special case of the type-2 fuzzy set. Because, Mendel and John [2] reformulatedall set operations in both the vertical-slice and wavy-slicemanner. They concluded that general particle type-2 fuzzyset opeartions are too complex to understand and implement,but operations using the interval type-2 fuzzy set involveonly simple interval arithmetics which means computationcosts are reduced. The interval type-2 fuzzy set consists offour parts: fuzzification, fuzzy rule base, inference engine anddefuzzifications. Moreover, the fuzzy rule base and intervaltype-2 fuzzy sets are complicated when determining the exactmembership function and complete fuzzy rule base. So, the
optimization of interval type-2 fuzzy set and fuzzy rule basemust be used to estimate the value by an expert system.Many researchers have proposed and introduced optimizationof interval type-2 fuzzy set and fuzzy rule base such asZhao [3] proposed adaptive interval type-2 fuzzy set usinggradient descent algorithms to optimize inference enginefuzzy rule base, Hidalgo [4] proposed optimization intervaltype-2 fuzzy set applied to modular neural network usinga genetic algorithm. Moreever, many reseachers apply theinterval type-2 fuzzy logic system for uncertain datasets.Also, thw creation of an optimized interval type-2 fuzzy logicsystem will gain the maximum accurate outputs. There arealso many optimization techniques which have been proposedfor building interval type-2 fuzzy systems. Some traditionaloptimization techniques are based on mathematics and someare based on heuristic algorithms. Some optimization tech-niques are often difficult and time consumming such asheuristic optimization. Sometimes, the improvment of theheruistic algorithms provides good performance such as thehybrid heruistic algorithms [5]. Moreover, hybrid heuristic isa much younger algorithm candidate compared to the geneticalgorithm and particle swarm optimization in the domain ofmeta- heuristic-based optimization.
In this paper, a new algorithm called the hybrid heuristicalgorithm which combines a genetic algorithm to particleswarm optimization is proposed. Also, a presention of anoptimization of interval type-2 fuzzy set and fuzzy rulebase using the proposed hybrid heuristic algorithm. Thealgorithm will be used to optimize a model by minimizing thenumber of fuzzy rules, minimizing the number of linguisticvariable and maximizing the accuracy of the output. Thenthe framework and the corresponding algorithms are testedand evaluated to prove the concept by applying it to theIris dataset [6] and Wisconsin Breast Cancer Dataset as anexample of classification [7].
The Eighth International Conference on Computing and Information Technology IC2IT 2012
36
II. RELATED WORK
A. Particle Swarm Optimization(PSO)
The PSO initializes a swarm of prticles at random, witheach particle deciding its new velocity and position based onits past optimal position P1 and the past optimal position ofthe swarm Pg . Let xi=(xi1, xi2..., xin) represent the currentposition of particle i, vi=(vi1, vi2,..., vin) its current velocityand Pi=(pi1, pi2,...,pin) its past optimal position, then theparticle uses the following equation to adjust its velocity andposition:
Vi,(t+1) = wVi,(t) + c1r1(Pi − xi,(t) + c2r2(Pg − xi,(t) (1)
xi,(t+1) = xi,(t) + Vi,(t+1) (2)
where c1 and c2 are constants of acceleration in the rangeof 0..2, r1 and r2 are random number in [0,1] and w is theweight of inertia, which is used to maintain the momentum ofthe particle. The first term on the right hand side in (1) is theparticle’s velocity in time t. The second term represents “selflearning” by the particle based on its own history. The lastterm reflects “social learning” through information sharingamong individual particles in the swarm. All three partscontribute to the particle’s search ability in the space analyzedwhich simulates the swarm behavior mathematically [8].
B. Genetic Algorithm(GA)
A GA generally has four components: 1) a population ofindividuals where each individual in the population repre-sents a possible solution, 2) a fitness function which is anevaluation function by which we can tell if an individual isa good solution or not, 3) a selection function which decideshow to pick good individuals from the current population forcreating the next generation, and 4) genetic operators such ascrossover and mutation which explore new regions of searchspace while keeping some of the current information at thesame time.
GAs are based on genetics, especially on Darwins theory(survival of the fittest). This states that the weaker membersof a species tend to die away, leaving the stronger andfitter. The surviving members create offspring and ensurethe continuing survival of the species. This concept togetherwith the concept of natural selection is used in informationtechnology to enhance the performance of computers [9].
C. Interval Type-2 Fuzzy Set
Interval type-2 fuzzy sets are particularly useful when it isdifficult to determine the exact membership function, or inmodeling the diverse options from different individulas. Themembership function, which interval type-2 fuzzy inferencesystem approximates expert knowledge and judgment inuncertain conditions, this can be constructed from surveys orusing optimization algorithms. Its basic framework consistsof four basic parts: fuzzification, fuzzy rule base, fuzzyinference engine and defuzzification shown in Fig. 1.
Fig. 1. Interval Type-2 Fuzzy System
We can describe the interval type-2 fuzzy logic system asfollows: the crisp sets inputs are first fuzzified into inputinterval type-2 fuzzy sets. In the fuzzifier, it creats themembership function which consists of types of membershipfunction, linguistic variable and fuzzy rule base. It has manytypes of the membership function such as triangular mem-bership function, trapezoidal membership function, Gaussianmembership function, Smooth Membership Function, Z-membership function and so on. So, the fuzzifier sends theinterval type-2 fuzzy set into the inference engine and the rulebase to produce output type-2 fuzzy sets. The interval type-2fuzzy logic system rules will remain the same as in the type-1 fuzzy logic system, but the antecedents and/or consequentswill be represented by interval type-2 fuzzy sets. A finitenumber of fuzzy rules, can be represented as if-then forms,then integrates into the fuzzy rule base. A standard fuzzy rulebase is shown below.
R1 : If x1 is A11 and x2 is A1
2, ..., xn is A1n Then y is B1.
R2 : If x1 is A21 and x2 is A2
2, ..., xn is A2n Then y is B2.
...
RM : If x1 is AM1 and x2 is AM
2 , ..., xn is AMn Then y is BM .
where x1,...,xn are state c=variables, y is control variable.The linguistic value ˜
Aj1,..., ˜Aj
n and Bj , (j=1,2,...,M) are re-spectively defined in the universe U1,...Un and V. In fuzzifi-cation, crisp input variable xi is mapped into interval Type-2
The Eighth International Conference on Computing and Information Technology IC2IT 2012
37
fuzzy set Axi, i = 1, 2, ..., n. The inference engine combines
all the fired rules and gives a non-linear mapping fromthe input interval type-2 fuzzy logic systems to the outputinterval type-2 fuzzy logic systems. The multiple antecedentsin each rule are connected by using the Meet operation, themembership grades in the input sets are combined with thosein the output sets by using the extended sup-star composition,and multiple rules are combined by using the Join operation.The type-2 fuzzy outputs of the inference engine are thenprocessed by the typereducer, which combines the output setsand performs a centroid calculation that leads to type-1 fuzzysets called the type-reduced sets. After the type-reductionprocess, the type-reduced sets are then defuzzified (by takingthe average of the type-reduced) to obtain crisp outputs. [3].
In the interval type-2 fuzzy logic system design, weassumed Z-membership function for the first membershipfunction, triangular membership function for the secondarymembership function and smooth membership function forthe last membership function, center of sets type reductionand defuzzification using the centroid of the typereduced set.
III. THE PROPOSED FRAMEWORK
In our framework, we present the new algorithms of hybridheuristic algorithm which are developed to optimize theinterval type-2 fuzzy logic system using Iris datasets andbreast cancer datasets. The new algorithm to optimize theinterval type-2 fuzzy sets and fuzzy rule base uses hybridheuristic searches which are a sequential combination of GAand PSO. The proposed algorithm will be used to optimizethe number of linguistic variables, parameters of membershipfunctions and the rule base which consists of constraint ofthe minimum linguistic variable, minimum rule base andmaximum accuracy. The framework is shown in Fig. 2.
From the framework, we can describe the steps of theproposed method for optimized interval type-2 fuzzy setand fuzzy rule base using hybrid heuristic searches. Theframework is given in four steps described below.
Step 1: Determine the structure of interval type-2 fuzzysystem framework.
Step 2: Determine the fuzzy rules base using clustering.Step 3: Determine the universes of the input and output
variables and their type of membership functions and lin-guistic parameter of membership functions.
Step 4: Determine and optimize the fuzzy inference engineusing the hybrid heuristic algorithms which is a combinationof GA and PSO.
Fig. 2. Framework of Optomization Interval Type-2 Fuzzy System UsingHybrid Heuristic Algorithms
1) Determine the structure of interval fuzzy type-2 system
framework
In Fig. 2. the framework shows the structure of theoptimization interval type-2 fuzzy sets and rule based onhybrid heuristic algorithms. The hybrid heruistic algorithmused sequential hybridization. The GA is used for the firstlocal optimal interval type-2 fuzzy sets which consist of in-terval type-2 membership function, interval type-2 linguisticparameter (LMF, UMF) and rule base. Moreover, the PSO isused for the last optimal which is a gaining the best resultdon’t care rule.
2) Determine the fuzzy rules base using clustering
We used the K-means clustering algorithm [10] to groupthe dataset to determine the feasibility of a fuzzy rule base. Astandard K-means clustering algorithm is shown as follows.
J =k∑
j=1
n∑i=1
‖ xji − cj ‖2, (3)
where K is clusters, ‖ xji − cj ‖2 is a chosen distence
measure between a data set point xji and the cluster centre
cj , is an indicator of the distence of the n data points fromtheir respective cluster centres.
3) Determine the universes of the input and output vari-
ables and their type of the membership functions
In the universe of input and output variables and theirprimary membership functions, the z-membership function,triangular membership function and smooth membershipfunction were used and are shown in Fig. 3. In Fig. 3.,the presention the four attributes of Iris membership func-tion are displayed and graded as attibute1=2, attribute2=2,
The Eighth International Conference on Computing and Information Technology IC2IT 2012
38
TABLE IPREDEFINED MEMBERSHIP FUNCTION FOR FIVE LINGUISTIC
VARIABLE.
Linguistic Index Linguistic Terms0 Don’t Care1 Very Low2 Low3 Medium4 High5 Very High
attribute3=5 and attribute4=5. The definiion of the linguisticlabel and number of linguistic variables are in Table 1.
Fig. 3. The Example of Interval type-2 Membership Functions
4) Determine and optimization the fuzzy inference engine
using the hybrid heuristic algorithms
Firstly, encoding the fuzzy rule based system into geno-type or chromosome. Each chromosome represents a fuzzysystem composed of the number of linguistic variables ineach dimension, the membership function parameters of eachlinguistic variable, and the fuzzy rules which consists of don’tcare rules from the PSO. A chromosome (chrom) consists of4 parts or genes:
chrom =
by GA︷ ︸︸ ︷[IM, IL,R
by PSO︷ ︸︸ ︷, DcR] (4)
where IL = [IL1, IL2, ...ILn] is a set of thenumber of interval linguistic variables, IM =
[im11, im12, .., imn,ILn] is a set of the interval
membership function parameters of the interval linguisticvariables, R = [R1, R2, ..., RIL1×IL2×...×ILn
] isthe fuzzy rule. R1 is interger number that is theindex of linguistic variable of each dimension, andDcR=[Ra111L1×L2×,...,×Ln
, Ra112L1×L2×,...,×Ln, ...,
RalmkL1×L2×,...,×Ln] is the don’t care rule.
RalmkL1×L2×,...,×Lnis the integer number which is the
index of the don’t care rule of each rule. The length of achromosome can be varied depending on the fuzzy partition
created by cross sections of the linguistic variables fromeach dimension. Then, the Fitness Function is
Fit = Acc(chromi)
where chromi = [chrom1, chrom2, ..., chromn] is a set ofthe chromosome number. The accuracy (Acc) is
Acc =Number of Correct Classification
Total Number of Training Data
IV. THE EXPERIMENTAL EVALUATION SETTING UP
To evaluate the proposed Hybrid Heuristic Type-2(HHType-2) algorithm for building interval type-2 fuzzysystems, two datasets were used which are benchmark clas-sification datasets from UCI data repository for machinelearning, Fishers iris data and Wisconsin Breast Cancer data.
A. Datasets
Iris dataset has 4 variables with 3 classes; 90 recordswere selected randomly for training, and the rest, 60 records,were used for testing. Wisconsin Breast Cancer data set has699 records, the missing attribute value of 16 records weredeleted. Each record consists of 9 features plus the classattribute; 500 records were selected randomly for training,and the rest, 183 records, were used for testing.
Fig. 4 shows the scatter plot of the Iris dataset, Fig. 5illustrates the scatter plot of the Iris dataset with clusteringusing K-Mean algorithms (K=7). Fig. 6 shows the scatterplot of the Wisconsin Breast Cancer dataset, and Fig. 7shows the scatter plot of the Wisconsin Breast Cancerdataset with clustering using K-Mean algorithms (K=4).
Fig. 4. The scatter plot of Iris Dataset (* represents Setosa, × representsVersicolor, and ? represents Verginica)
B. Experimental Results
The experiments were performed on a MacBook Pro IntelCore 2 Duo CPU, speed 2.66 GHz, ram 4.00 GB RAM,
The Eighth International Conference on Computing and Information Technology IC2IT 2012
39
Fig. 5. The scatter plot of Iris Dataset with Clustering (* represents Setosa,× represents Versicolor, and ? represents Verginica)
Fig. 6. The scatter plot of Wisconsin Breast Cancer Dataset (* representsClass 2, and ? represents Class 4)
running on Mac os. All algorithms are implemented usingMatlab.
The first dataset (Iris dataset), ran 20 times with the aver-ages execute time of 662.2635s. The simulation populationwas 100 individuals. Then, the largest individuals from thePSO were used to optimize the “don’t care” rule. In the PSO,each of the individuals were simulated with 50 swarms and5 particles. The PSO completed 20 runs with the excite timeof 429.7597s.
In the second dataset (Wisconsin Breast Cancer (WBC)),ran 20 times with the average execute time of 3679.2428s.The simulation population was 100 individuals. The individ-uals from PSO were used to optimize the “don’t care” rule.The individuals of the PSO were simulated with 50 swarms
Fig. 7. The scatter plot of Wisconsin Breast Cancer Dataset with Clustering(* represents Class 2, and ? represents Class 4)
TABLE IICONFUSION MATRIX FOR THE IRIS CLASSIFICATION DATA.
Dataset Membership Rule Class AccIris [2 2 5 5] 0 0 1 1 1
0 1 3 3 22 1 5 5 3
Total Acc 95%WBC [2 2 3 2 2 3 2 2 2] 0 0 0 0 0 0 0 0 0 2
1 0 1 0 2 1 2 2 2 42 2 3 2 2 1 2 2 2 4
Total Acc 98.71%
TABLE IIICONFUSION MATRIX FOR THE IRIS CLASSIFICATION DATA.
Attribute Setosa Versicolor Verginica Total TestingSetosa 20 0 0 20
Versicolor 0 19 1 20Verginica 0 2 18 20
Total 20 21 19 60
and 5 particles. Then, the PSO completed 20 runs with theexecute time of 2387.5543s. The optimal fuzzy system whichwas optimized using the hybrid heuristic algorithm generatedaccuracy performance as shown in Tables 2, 3. An exampleof a chromosome from the WBC datasets is shown in Fig.8.
Membership︷ ︸︸ ︷[2 2 3 2 2 3 2 2 2]
Linguisticparameter︷ ︸︸ ︷1.9782 3, 4612 7.8462 9.1217 3.3353
3.3353 6.5211 1.8434 4.2727 1.0098 1.0312 1.6815 8.39991.9247 3.5459 1.9992 5.2612 1.0692 1.1521 2.1435 2.15563.6942 7.6163,........,2.6585 3.1273 7.0503 9.8831 3.91313.1549 6.9534 111111111 1 222123221 4 223222221 4000000000 2 101021222 4 223221222 4︸ ︷︷ ︸
RuleBased
Fig. 8. Chromosome of Interval type-2 Fuzzy Logic System WBC dataset
To prove the excellent performance of this proposed frame-work, we compared its accuracy with other well-known clas-sifiers, manipulated for the same probem. Table 4 presentsthe accuracy performance of classifiers with these algorithms.From Table 4, it can be seen that the accuracy performanceof the proposed hybrid heuristic algorithm is among the bestachieved.
In the same way, we compared the results of the confi-dence gained from experiments using the algorithms withthe same problem to other algorithms. Table 5 shows theaccuracy performance of classifier from these algorithmsand the confidence of the Wisconsin Breast Cancer datasetusing the Hybrid Heuristic Type-2 (HHType-2) algorithm,which results were competitive or even better than any otheralgorithm. Although GA and PSO are not new, when the twocome together they make a powerful new algorithm (Hybrid
The Eighth International Conference on Computing and Information Technology IC2IT 2012
40
Fig. 9. The Bar chart of comparisons of the HHType-2 and the otheralgorithms, for the Iris data.
Fig. 10. The Bar chart of comparisons of the HHType-2 and the otheralgorithms, for the WBC data.
Heuristic Type-2) for optimization which it is quite efficientreferring to the performance.
V. CONCLUSION
In this paper, a methodology based on a hybrid heuristicalgorithm, a combination of PSO and GA approaches, isproposed to build interval type-2 fuzzy set for classification.The algorithms are used to optimize a model by minimizingthe number of fuzzy rule, minimizing the number of linguisticvariable and maximizing the accuracy of the fuzzy rule base.The performance of the proposed hybrid heuristic algorithmwas demonstrated well by applying it to the benchmarkproblem and the comparison with several other algorithms.
For the future research, the application of the proposedalgorithm to other problems such as intrusion detectionnetwork, network forensic etc., and the use of lerger datasetsthan this research such as Breast Cancer Diagnosis, trafficnetwork dataset etc, will be covered. Therefore, an adaptiveon-line inference engine of the interval type-2 fuzzy set willbe selected for future research of Breast Cancer Diagnosisfor medical training and testing.
REFERENCES
[1] J. M. Mendel, “Why we need type-2 fuzzy logic system ?” May 2001,http://www.informit.com/articles/article.asp.
[2] J. M. Mendel and R. I. B. John, “Type-2 fuzzy sets made simple,”IEEE Trans. Fuzzy Syst, vol. 10, pp. 117–127, April 2002.
TABLE IVCOMPARISONS OF THE HHTYPE-2 AND THE OTHER ALGORITHMS, FOR
THE IRIS DATA.
Algorithm Setosa Versicolor Verginica Acc1.VSM [11] 100% 93.33% 94% 95.78%
2.NT-growth [11] 100% 93.5% 91.13% 94.87%3.Dasarathy [11] 100% 98% 86% 94.67%
4.C4 [11] 100% 91.07% 90.61% 93.87%5.IRSS [12] 100% 92% 96% 96%
6.PSOCCAS [13] 100% 96% 98% 98%7.HHTypeI [5] 100% 97% 98% 98%8. HHType II 100% 95% 90% 95%
TABLE VCOMPARISONS OF THE HHTYPE-2 AND THE OTHER ALGORITHMS, FOR
THE WBC DATA.
Algorithm Accuracy1. SANFIS [14] 96.07%2. FUZZY [15] 96.71%3. ILFN [15] 97.23%
4. ILFN-FUZZY [15] 98.13%5. IGANFIS [16] 98.24%
6. HHType II 98.71%
[3] L. Zhao, “Adaptive interval type-2 fuzzy control based on gradientdescent algorithm,” in Intelligent Control and Information Processing(ICICIP), vol. 2, 2011, pp. 899–904.
[4] D. Hidalgo, P. Melin, O. Castillo, and G. Licea, “Optimization ofinterval type-2 fuzzy systems based on the level of uncertainty, appliedto response integration in modular neural networks with multimodalbiometry,” in The 2010 International Joint Conference on DigitalObject Identifier, 2010, pp. 1–6.
[5] A. Sangsongfa and P. Meesad, “Fuzzy rule base generation by a hybridheuristic algorithm and application for classification,” in NationalConference on Computing and Information Technology, vol. 1, 2010,pp. 14–19.
[6] IrisDataset, http://www.ailab.si/orange/doc/datasets/Iris.htm.[7] Breast Cancer Dataset, http://www.breastcancer.org.[8] J. Zeng and L. Wang, “A generalized model of particle swarm
optimization,” Pattern Recognition and Artificial Intelligence, vol. 18,pp. 685–688, 2005.
[9] H. Ishibuchi, T. Nakashima, and T. Murata, “Three-objective geneticbased machine learning for linguistic rule extraction,” InformationSciences, vol. 136, pp. 109–133, 2001.
[10] R. Salman, V. Kecman, Q. Li, R. Strack, and E. Test, “Fast k-meansalgorithm clustering,” Transactions on Machine Learning and DataMining, vol. 3, p. 16, 2011.
[11] T. P. Hong and J. B. Chen, “Processing individual fuzzy attributes forfuzzy rule induction,” in Fuzzy Sets and Systems, vol. 10, 2000, pp.127–140.
[12] A. Chatterjee and A. Rakshit, “Influential rule search scheme (irss)anew fuzzy pattern classifier,” in IEEE Transaction on Knowledge andData Engineering, vol. 16, 2004, pp. 881–893.
[13] L. Hongfei and P. Erxu, “A particle swarm optimization- aided fuzzycloud classifier applied for plant numerical taxonomy based on attributesimilarity,” in Expert Systems with Applications, vol. 36, 2009, pp.9388–9397.
[14] H. Song, S. Lee, D. Kim, and G. Park, “New methodology of computeraided diagnostic system on breast cancer,” in Second InternationalSymposium on Neural Networks, 2005, pp. 780–789.
[15] P. Meesad and G. Yen, “Combined numerical and linguistic knowledgerepresentation and its application to medical diagnosis,” in Componentand Systems Diagnostics, Prognostics, and Health Management II,2003.
[16] M. Ashraf, L. Kim, and X. Huang;, “Information gain and adap-tive neuro-fuzzy inference system for breast cancer diagnoses,” inComputer Sciences and Convergence Information Technology (ICCIT),2010, pp. 911–915.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
41
Neural Network Modeling for an Intelligent
Recommendation System Supporting SRM for
Universities in Thailand
Kanokwan Kongsakun
School of Information Technology
Murdoch University, South Street,
Murdoch,WA 6150 AUSTRALIA
Jesada Kajornrit
School of Information Technology
Murdoch University, South Street,
Murdoch,WA 6150 AUSTRALIA
Chun Che Fung
School of Information Technology
Murdoch University, South Street,
Murdoch,WA 6150 AUSTRALIA
Abstract— In order to support the academic management
processes, many universities in Thailand have developed
innovative information systems and services with an aim to
enhance efficiency and student relationship. Some of these
initiatives are in the form of a Student Recommendation
System(SRM). However, the success or appropriateness of such
system depends on the expertise and knowledge of the counselor.
This paper describes the development of a proposed Intelligent
Recommendation System (IRS) framework and experimental
results. The proposed system is based on an investigation of the
possible correlations between the students’ historic records and
final results. Neural Network techniques have been used with an
aim to find the structures and relationships within the data, and
the final Grade Point Averages of freshmen in a number of
courses are the subjects of interest. This information will help the
counselors in recommending the appropriate courses for students
thereby increasing their chances of success.
Keywords-Intelligent Recommendation System; Student
Relationship Management; data mining; neural network
I. INTRODUCTION
The growing complexity of technology in educational
institutions creates opportunities for substantial improvements
for management and information systems. Many designs and
techniques have allowed for better results in analysis and
recommendations. With this in mind, universities in Thailand
are working hard to improve the quality of education and
many institutes are focusing on how to increase the student
retention rates and the number of completions. In addition, a
university’s performance is also increasingly being used to
measure its ranking and reputation [1]. One form of service
which is normally provided by all universities is Student
Counseling. Archer and Cooper [2] stated that the provision of
counseling services is an important factor contributing to
students’ academic success. In addition, Urata and Takano [3]
stated that the essence of student counseling should include
advices on career guidance, identification of learning
strategies, handling of inter-personal relation, along with self-
understanding of the mind and body. It can be said that a key
aspect of student services is to provide course guidance as this
will assist the students in their course selection and future
university experience.
On the other hand, many students have chosen particular
courses of study just because of perceived job opportunities,
peer pressure and parental advice. Issues may arise if a student
is not interested in the course, or if the course or career is not
suitably matched with the student’s capability[4]. In
Thailand’s tertiary education sector, teaching staff may have
insufficient time to counsel the students due to high workload
and there are inadequate tools to support them. Hence, it is
desirable that some forms of intelligent recommendation tools
could be developed to assist staff and students in the
enrolment process. This forms the motivation of this research.
One of the initiatives designed to help students and staff is
the Student Recommendation System. Such system could be
used to provide course advice and counseling for freshmen in
order to achieve a better match between the student’s ability
and success in course completion. In the case of Thai
universities, this service is normally provided by counselors or
advisors who have many years of experience within the
organisation. However, with increasing number of students
and expanded number of choices, the workload on the advisors
is becoming too much to handle. It becomes apparent that
some forms of intelligent system will be useful in assisting the
advisors.
In this paper, a proposed intelligent recommendation
system is reported. This paper is structured as follows. Section
2 describes literature reviews of Student Relationship
Management (SRM) in universities and issues faced by Thai
university students. Section 3 describes Neural Network
techniques which are used in the reported Intelligent
Recommendation System, and Section 4 focuses on the
proposed framework, which presents the main idea and the
research methodology. Section 5 describes the experiments
and the results. This paper then concludes with discussions on
the work to be undertaken and future development.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
42
II. LITERATURE REVIEW
A. Student Relationship Management in Universities According to literature, the problem of low student
retention in higher education could be attributed to low student satisfaction, student transfers and drop-outs [5]. This issue leads to a reduction in the number of enrolments and revenue, and increasing cost of replacement. On the other hand, it was found that the quality and convenience of support services are other factors that influence students to change educational institutes [6]. Consequently, the concept of SRM has been implemented in various universities so as to assist the improvement of the quality of learning processes and student activities.
Definitions of SRM have been adopted from the
established practices of Customer Relationship Management
(CRM) which focuses on customers and are aimed to establish
effective competition and new strategies in order to improve
the performance of a firm [7]. In the case of SRM, the context
is within the education sector. Although there have been many
research focused on CRM, few research studies have
concentrated on SRM. In addition, the technological supports
are inadequate to sustain SRM in universities. For instance, a
SRM system’s architecture has been proposed so as to support
the SRM concepts and techniques that assist the university’s
Business Intelligent System [8]. This project provided a tool to
aid the tertiary students in their decision-making process. The
SRM strategy also provided the institution with SRM
practices, including the planned activities to be developed for
the students, as well other relevant participants. However, the
study verified that the technological support to the SRM
concepts and practices were insufficient at the time of writing
[8].
In the context of educational institutes, the students may
be considered having a role as “customers”, and the objective
of Student Relationship Management is to increase their
satisfaction and loyalty for the benefits of the institute. SRM
may be defined under a similar view as CRM and aims at
developing and maintaining a close relationship between the
institute and the students by supporting the management
processes and monitoring the students’ academic activities
and behaviors. Piedade and Santos (2008) explained that
SRM involves the identification of performance indicators
and behavioral patterns that characterize the students and the
different situations under which the students are supervised.
In addition, the concept of SRM is “understood as a process
based on the student acquired knowledge, whose main
purpose is to keep a close and effective students institution
relationship through the closely monitoring of their academic
activities along their academic path” [9]. Hence, it can be said
that SRM can be utilised as an important means to support
and enhance a student’s satisfaction. Since understanding the
needs of the students is essential for their satisfaction, it is
necessary to prepare strategies in both teaching and related
services to support Student Relationship Management. This
paper therefore proposes an innovative information system to
assist students in universities in order to support the SRM
concept.
B. Issues Faced By Thai University Students
Another study at Dhurakij Pundit University, Thailand
looked at the relationship between learning behaviour and low
academic achievement (below 2.0 GPA) of the first year
students in the regular four-year undergraduate degree
programs. The results indicated that students who had low
academic achievement had a moderate score in every aspect of
learning behaviour. On average, the students scored highest in
class attendance, followed by the attempt to spend more time
on study after obtaining low examination grades. Some of the
problems and difficulties that mostly affected students’ low
academic achievement were the students’ lack of
understanding of the subject and lack of motivation and
enthusiasm to learn [10].
Moreover, some other studies had focused on issues
relating to students’ backgrounds prior to their enrolment,
which may have effects on the progress of the students’
studies. For example, a research group from the Department of
Education[11], Thailand studied the backgrounds of 289,007
Grade twelve students which may have affected their
academic achievements. The study showed that the factors
which could have effects on the academic achievement of the
students may be attributed to personal information such as
gender and interests, parental factors such as their jobs and
qualifications, and information on the schools such as their
sizes, types and ranking.
Therefore, in the recruitment and enrolment of students in
higher education, it is necessary to meet the student’s needs
and to match their capability with the course of their choice.
The students’ backgrounds may also have a part to play in the
matching process. Understanding the student’s needs will
implicitly enhance the student’s learning experience and
increase their chances of success, and thereby reduce the
wastage of resources due to dropouts, and change of programs.
These factors are therefore taken into consideration in the
proposed recommendation system in this study.
III. NEURAL NETWORK BASED INTELLIGENT
RECOMMENDATION SYSTEM TO SUPPORT SRM
In term of education systems, Ackerman and Schibrowsky
[12] have applied the concept of business relationships and
proposed the business relationship marketing framework. The
framework provided a different view on retention strategies
and an economic justification on the need for implementing
retention programs. The prominent result is the improvement
of graduation rates by 65% by simply retaining one additional
student out of every ten. The researcher added that this
framework is appropriate both on the issues of places on
quality of services. Although some problems could not be
solved directly, it is recognized that Information and
Communication Technologies (ICT) can be used and
contributes towards maintaining a stronger relationship with
students in the educational systems [8].
In this study, a new intelligent Recommendation System is
proposed to support universities students in Thailand. This
System is a hybrid system which is based by Neural Network
Identify applicable sponsor/s here. (sponsors)
The Eighth International Conference on Computing and Information Technology IC2IT 2012
43
and Data Mining techniques; however, this paper only focuses
on the aspect of Neural Network (NN) techniques.
With respect to the Neural Network algorithm that was used
in this study, the feed-forward neural network, also called
Multilayer Perceptron was used. In the training of a Multilayer
perceptron, back propagation learning algorithm (BP) was
used to perform the supervised learning process [13]. The
feed-forward calculations which use in this experiment, the
activations set to the values of the encoded input fields in
Input Neurons. The activation of each neuron in a hidden or
output layer is calculated and shown as follows:
bi = σ ( ∑jѡijPj ) (1)
where bi is the activation of neuron i, j is the set of neurons
in the preceding layer, ѡij is the weight of the connection
between neuron i and neuron j, Pj is the output of neuron j, and
σ ( m ) is the sigmoid or logistic transfer function, which show
as follows
σ(m) = 1/(1+e-m
) (2)
The implementation of back propagation learning updates
the network weights and biases in the direction in which the
system performance increases most rapidly.
This study used a feed-forward network architecture and the
Mean Absolute Error (MAE) to define the accuracy of the
models.
IV. THE PROPOSED FRAMEWORK
Several solutions have been proposed to support SRM in the
universities; however, not many systems in Thailand have
focused on recommendation systems using historic records
from graduated students. A recommendation system could
apply statistical, artificial intelligence and data mining
techniques by making appropriate recommendation for the
students. Figure 1 illustrates the proposed recommendation
system architecture. This proposal aims to analyse student
background such as the high school where the student studied
previously, school results and student performance in terms of
GPA’s from the university’s database. The result can then be
used to match the profiles of the new students. In this way, the
recommendation system is designed to provide suggestions on
the most appropriate courses and subjects for the students,
based on historical records from the university’s database.
A. Data-Preprocessing
Initially, data on the student records are collected from the
university enterprise database. The data is then re-formatted in
the stage of data transformation in order to prepare for
processing by subsequent algorithms. In the data cleaning
process, the parameters used in the data analysis are identified
and the missing data are either eliminated or filled with null
values [15]. Preparation of analytical variables is done in the
data transformation step or being completed in a separate
process. Integrity of the data is checked by validating the data
Figure 1. Proposed Hybrid Recommendation System Framework to Support
Student Relationship Management
against the legitimate range of values and data types. Finally,
the data is separated randomly into training and testing data
for processing by the Neural Network.
B. Data Analysis
It can be seen in Fig. 1 that the Association rules, Decision
Tree, Support Vector Machines and Neural Network are used
to train the input data; however, this paper focuses on Neural
Network which uses the feed-forward algorithm to classify the
data and to establish the approximate function. The
backpropagation algorithm is a multilayer network, it uses log-
sigmoid as the transfer function, logsig. In the training
process, the backpropagation training functions in the feed-
forward networks is used to predict the output based on the
input data.
4. Model Validation
Student
A.Course
Recommendation
Course
Ranking Recom
mendati
on
Likelihood
of overall
GPA
B. Likelihood of
GPA
GPA for year 1
GPA for year 2
GPA for year 3
GPA for year 4
Recommenda
tion
Electronic
Intelligent
Recommendatio
n System(e-IRS)
Sub-models of A/B/C:
Computer Business
Sub-models of A/B/C:
Communication Art
…
Sub-models of A/B/C:
Information Technology
3. Intelligent Prediction Models
C. Subject
Recommendation
Subject
Recommen
dation
Student Historic data
1. Data Pre-processing
Data
Transformation Data Cleaning
Neural Network Decision Tree SVM
Association
Rules Result Comparison/ Best
Result
The Eighth International Conference on Computing and Information Technology IC2IT 2012
44
C. Intelligent Recommendation Model
The Integrated Recommendation Model is composed of three
parts: Course Recommendation for freshmen, Likelihood of
GPA for students (years 1 to 4), and Subject Recommendation
for students (year 1 to 4) respectively.
Part A focuses on the course recommendation for freshmen
and it is composed of two sections, which are the Overall GPA
Recommendation, and the Course Ranking Recommendation
respectively. In the section of Overall GPA, The output of this
recommendation is in terms of an expected overall GPA. The
outputs of Course Ranking Recommendation use the ranking
of results in first section to indicate five appropriate courses
The results of both parts can be used as suggestions to the
freshmen during the enrolment process. Some example results
from Part A are shown in this paper, and the input data of
these 2 sections in the model are shown in Table 1.
Another part of the framework focuses on Likelihood of
GPA for students in each year. After the students selected the
course to study and completed the enrolment process, the
Likelihood of GPA for year 1 results can be used to monitor
the performance of this group of students. The input data of
this process is the same as the one shown in Table 1, with the
addition of the GPA scores from the previous year. These are
used as the extended features in the input to the neural
network model. The result of the Recommendation is the GPA
score of the year. In the same way, the system may be used to
perform a Likelihood of GPA for Year 2 based on results from
the first year. Similar approach can be adopted for the
Likelihood of Year 3 and 4 results. Some example results of
this part are shown in this paper.
The final part of the recommendation model focuses on the
subject recommendation for students in each year. This way
also can help the counselor or student’s supervisor recommend
student to enroll the subjects in each semester.
To address the issue of imbalanced number of students in
each course, the prediction model shown in Fig. 1 can be
duplicated for different departments. The models’ computation
is entirely data-driven and not based on subjective opinion,
hence, the prediction models are unbiased and they will be
used as an integral part of an Electronic Intelligent
Recommendation System.
D. Electronic Intelligent Recommendation System (e-IRS)
It is planned that the new intelligent Recommendation
Models will form an integral part of an online system for
private universities in Thailand. The developed system will be
evaluated by the university management and feedback from
experienced counselors will be sought. The proposed system
will also be available for use by new students who will access
the online-application in their course selection during the
enrolment process. As for the recommendation of the Year 2
and subsequent years’ results, this could be used by the
counselors, staff, student’s supervisor and university
management to provide supports for students who are likely to
need help with their studies. This information will enable the
university to better focus on the utilisation of their resources.
In particular, this could be used to improve the retention rate
by providing additional supports to the group of students who
may be at risk.
V. EXPERIMENT DESIGN
The data preparation and selection process involves a
dataset of 3,550 student records from five academic years. All
the student data have included records from the first year to
graduation. Due to privacy issue, the data in this study do not
indicate any personal information, and no student is identified
in the research. The student data has been randomised, and all
private information has been removed. Example data from the
dataset is shown below.
TABLE I. EXAMPLE OF TRAINING SAMPLE DATASET
Uni
ID
Input data: previous school data tar
get
Pre
-GP
A
T
ype
of
sch
ool
N
o. o
f A
ward
s
Tale
nt
an
d I
nte
rest
Ch
an
nel
s
Adm
issi
on
Rou
nd
G
uard
ian
Occ
upati
on
Gen
der
Uni GPA
4800 2.35 C 0.2 1 Poster 1 Police F 3.75
4801 3.55 B 0.3 4 Brochur
e
2 Governor M 3.05
5001 2.55 A 0.9 3 Friend 5 Teacher F 2.09
5002 2.75 G 0.4 5 Family 4 Nurse F 2.58
5003 3.00 F 0.2 7 Newspap
er
3 Teacher M 2.77
5101 2.00 E 0.1 2 others 1 Farmer F 2.11
Table 1 shows the randomized student ID, GPA from previous
study, the type of school, awards received, talent and interest,
channels to know the university, admission round, Guardian
Occupation, Gender and Overall GPA from university. Table 2
provides the definitions for the variables used in the above
table.
TABLE II. DEFINITIONS OF VARIABLES
No. Variables Definition
1.
UniID
Randomized Student ID which is not
included in the clustering process.
They are only used as an
identification of different students
2. GPA Overall GPA results from previous
study prior to admission to university
3.
Type of school
The school types are separated as
follows
A: High School
B:Technical College
C: Commercial College
D: Open School
E: Sports, Thai Dancing, Religion or
Handcraft Training Schools
F: Other Universities
(change universities or courses)
G: Vocation Training Schools
4.
Number of
Awards
Awards that students have received
from previous study (normalized
between 0.0 to 4.0, 0.0 – received no
award, 4.0 – received max no. of
The Eighth International Conference on Computing and Information Technology IC2IT 2012
45
awards in the dataset)
5.
Talent and Interest
(in Group number)
Talent and the interest(1= sports,
2=music and entertainment, 3=
presentation, 4=academic, 5=others,
6= involved with 2 to 3 items of
talents and interests, 7= involved with
more than 3 talents and interests)
6 Channels The channels to know the university
such as television, family
7 Admission Round Admission round of each university
which can be round 1 to 5
8 Guardian
Occupation
The occupation of Guardian such as
teacher, governor
9. Gender Gender: Female or Male
10. Uni GPA Overall GPA in university which the range
is from 0 to 4
Figure 2. Number of samples in each department
The student records have been divided into 70% of training
data and 30% of testing data randomly. The dataset includes
both qualitative and quantitative information in Table 1 and 2.
In terms of training, this study used a two layer feed forward
network architecture. Moreover, this study used the Mean
Absolute Errors (MAE) to define the accuracy of the models.
VI. EXPERIMENTAL RESULTS
Based on MAE, the experimental results have shown that
the Neural Network based models can be utilised to predict the
GPA results of students with a good degree of accuracy.
0.0870.069 0.074
0.115
0.177 0.1770.190
0.125
0.172 0.182
0.0800.100
0.344
0.101
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
Figure 3. Comparison of MAE of testing data of sub-models for overall GPA
and course ranking Recommendation
The testing was carried out in the final step of the
experiment in each model, which used 30% of the available
data. In Fig. 3, it is shown that the lowest value of Mean
Absolute Error (MAE) is 0.069 based on data from the
Department of Accounting. On the other hand, the highest
value is 0.344. The average of MAE of all models is 0.142.
The overall results obtained indicated reasonable prediction
results were obtained.
Figure 4. Comparison of MAE of testing data on the Likelihood of GPA in
each Year
Fig. 4 shows a comparison of MAE of the results of the
sub-models from each department in each year. It can be seen
that the range of values of MAE is the lowest based on data
from the Department of Education. On the other hand, the
highest value is based on the Department of Communication
Arts, which is similar with the results of overall GPA. The
average of MAE of all models is 0.393. Considering, the
department of Public Administration gives the similar results
between each year, while the department of Communication
Art and the department of Industrial Management give the
most different results between each year, which the results of
MAE is too high from another in year 4 and year 2
respectively. It’s possible that the difference of MAE is due to
The Eighth International Conference on Computing and Information Technology IC2IT 2012
46
the number of training and testing data. The overall results
obtained have indicated reasonable recommendation results.
VII. CONCLUSIONS
This article describes a recommendation system in support
of SRM and to address issues related to the problem of course
advice or counseling for university students in Thailand. The
recent work is focusing on the development and
implementation of each process in the framework. The
experiments have been based on Neural Network models and
the accuracy of the recommendation model is reasonable. It is
expected that the recommendation system will provide a
useful service for the university management, course
counselors, academic staff and students. The proposed system
will also support Student Relationship Management strategies
among the Thai private universities.
REFERENCES
[1] R. Ackerman and J. Schibrowsky, “A Business Marketing Strategy Applied to Student Retention: A higher Education Initiative,” Journal of College Student Retention . vol. 9(3), pp. 330-336, 2007-2008
[2] J. Jr. Archer and S. Cooper,“Counselling and Mental Health Services on Campus. In A handbook of Contemporary Practices and Challenges, ” Jossy-Bass, ed. San Francisco, CA., 1998
[3] A.L. Caison, “Determinates of Systemic Retention: Implications for improving retention practice in higher education. ” Journal of College Student Retention., vol. 6, pp. 425-441, 2004-2005
[4] K. L. Du and M.N.S Swamy, “Neural Networks in a Softcomputing Framework,” Germany: Springer , vol. 1, 2006
[5] Education, “Research group of department of A study of the background of grade twelve affect students different academic achievements,” Education Research, 2000
[6] D.T. Gamage, J. Suwanabroma, T. Ueyama, S. Hada. and E.Sekikawa, “The impact of quality assurance measures on student services at the Japanese and Thai private universities,” Quality Assurance in Education, vol 16(2), pp.181-198, 2008
[7] Y. Gao and C. Zhang, “Research on Customer Relationship Management Application System of Manufacturing Enterprises,” Wireless Communications, Networking and Mobile computing, 2008 Wicom'08.4th International conference, Dalian , pp. 1-4, 2008
[8] K. Harej and R.V. Horvat, “Customer Relationship Management Momentum for Business Improvement,” Information Technology Interfaces(ITI), pp.107-111, 2004
[9] Helland, P., H.J. Stallings, and J.M. Braxton, “The fulfillment of expectations for college and student departure decisions,”, Journal of College Student Retention, vol. 3(4), pp.381-396, 2001-2002
[10] N. Jantarasapt, “The relationship between the study behavior and low academic achievement of students of Dhurakij Pundit University, Thailand”, Dhurakij Pundit University, 2005
[11] K. Jusoff, S.A.A. Samah, and P.M. Isa, “Promoting university community's creative citizenry. Proceedings of world academy of science, 2008,” Engineering and technology , vol. 33, pp. 1-6.. 2008
[12] M.B. Piedade and M. Y. Santos, “Student Relationship Management: Concept, Practice and Technological Support,” IEEE Xplore, pp. 2-5, 2008
[13] S. Subyam, “Causes of Dropout and Program Incompletion among Undergraduate Students from the Faculty of Engineering,King Mongkut's University of Technology North Bangkok”., In The 8th National Conference on Engineering Education. Le Meridien Chiang Mai, Muang, Chiang Mai, Thailand, 2009
[14] U. Uruta and A. Takano, “Between psychology and college of education,” Journal of Educational Psychology, vol 51, pp. 205-217, 2003
[15] K.W. Wong, C.C. Fung and T.D. Gedeon, “Data Mining Using Neural Fuzzy for Student Relationship Management,” International Conference of Soft Computing and Intelligent Systems, Tsukuba, Japan, 2002
The Eighth International Conference on Computing and Information Technology IC2IT 2012
47
Recommendation and Application of Fault Tolerance Patterns to Services
Tunyathorn Leelawatcharamas and Twittie Senivongse Computer Science Program, Department of Computer Engineering
Faculty of Engineering, Chulalongkorn University Bangkok, Thailand
[email protected] , [email protected]
Abstract—Service technology such as Web services has been one of the mainstream technologies in today’s software development. Distributed services may suffer from communication problems or contain faults themselves, and hence service consumers may experience service interruption. A solution is to create services which can tolerate faults so that failures can be made transparent to the consumers. Since there are many patterns of software fault tolerance available, we end up with a question of which pattern should be applied to a particular service. This paper attempts to recommend to service developers the patterns for fault tolerant services. A recommendation model is proposed based on characteristics of the service itself and of the service provision environment. Once a fault tolerance pattern is chosen, a fault tolerant version of the service can be created as a WS-BPEL service. A software tool is developed to assist in pattern recommendation and generation of the fault tolerant service version.
Keywords - fault tolerance patterns; Web services; WS-BPEL
I. INTRODUCTION
Service technology has been one of the mainstream technologies in today’s software development since it enables rapid flexible development and integration of software systems. The current Web services technology builds software upon basic building blocks called Web services. They are software units that provide certain functionalities over the Web and involve a set of interface and protocol standards, e.g. Web Service Definition Language (WSDL) for describing service interfaces, SOAP as a messaging protocol, and Business Process Execution Language (WS-BPEL) for describing business processes of collaborating services [1]. Like other software, services may suffer from communication problems or contain faults themselves, and hence service consumers may experience service interruption.
Different types of faults have been classified for services [2], [3], [4], and can be viewed roughly in three categories: (1) Logic faults comprise calculation faults, data content faults, and other logic-related faults thrown specifically by the service. Web service consumers can detect logic faults by WSDL fault messages or have a way to check correctness of service
responses. (2) System and network faults are those that can be identified, for example, through HTTP status code and detected by execution environment, e.g., communication timeout, server error, service unavailable. (3) SLA faults are raised when services violate SLAs, e.g., response time requirements, even though functional requirements are fulfilled. For service providers, one of the main goals of service provision is service reliability. Services should be provided in a reliable execution environment and prepared for various faults so that failures can be made as transparent as possible to service consumers. Service designers should therefore design services with a fault tolerance mindset, expecting the unexpected and preparing to prevent and handle potential failures.
There are many fault tolerance patterns or exception handling strategies that can be applied to make software and systems more reliable. Common patterns involve how to handle or recover from failures, such as communication retry or the use of redundant system nodes. In a distributed services context, we end up with a question of which fault tolerance pattern should be applied to a particular service. We argue that not all patterns are equally appropriate for any services. This is due to the characteristics of each service including service semantics and the environment of service provision. In this paper, we propose a mathematical model that can assist service designers in designing fault tolerant versions of services. The model helps recommend which fault tolerance patterns are suitable for particular services. With a supporting tool, service designers can choose a recommended pattern and have fault tolerant versions of the services generated as WS-BPEL services.
Section II discusses related work in Web services fault tolerance. Section III lists fault tolerance patterns that are considered in our work. Characteristics of the services and condition of service provision that we use as criteria for pattern recommendation are given in Section IV. Section V presents how service designers can be assisted by the pattern recommendation model. The paper concludes in Section VI with future outlook.
II. RELATED WORK
A number of researches in the area of fault tolerance services address the application of fault tolerance patterns to WS-BPEL processes even though they may have a different use of fault tolerance terminology for similar patterns or
The Eighth International Conference on Computing and Information Technology IC2IT 2012
48
strategies. For example, Dobson’s work [5] is among the first in this area which proposes how to use BPEL language constructs to implement fault tolerant service invocation using four different patterns, i.e., retry, retry on a backup, and parallel invocations to different backups with voting on all responses or taking the first response. Lau et al. [6] use BPEL to specify passive and active replication of services in a business process and also support a backup of BPEL engine itself. Liu et al. [2] propose a service framework which combines exception handling and transaction techniques to improve reliability of composite services. Service designers can specify exception handling logic for a particular service invocation as an Event-Condition-Action rule, and eight strategies are supported, i.e., ignore, notify, skip, retry, retryUntil, alternate, replicate, and wait. Thaisongsuwan and Senivongse [7] define the implementation of fault tolerance patterns, as classified by Hanmer [8], on BPEL processes. Nine of the architectural, detection, and recovery patterns are addressed, i.e., Units of Mitigation, Quarantine, Error Handler, Redundancy, Recovery Block, Limit Retries, Escalation, Roll-Forward, and Voting. These researches suggest that different patterns can be applied to different service invocations as appropriate but are not specific on when to apply which. Nevertheless we adopt their BPEL implementations of the patterns for the generation of our fault tolerant services.
Zheng and Lyu present interesting approaches to fault tolerant Web services which support strategies including retry, recovery block, N-version programming (i.e., parallel service invocations with voting on all responses), and active (i.e., parallel service invocations with taking the first response). For composite services, they propose a QoS model for fault tolerant service composition which helps determine which combination of the fault tolerance strategies gives a composite service the optimal quality [9]. In the context of individual Web services, they propose a dynamic fault tolerance strategy selection for a service [3]; the optimal strategy is one that gives optimal service roundtrip time and failure rate. Both user-defined service constraints and current QoS information of the service are considered in the selection algorithm. In [10], they view fault tolerance strategies as time-redundancy and space-redundancy (i.e., passive and active replication) as well as combination of those strategies. Although their approaches and ours share the same motivation, their fault tolerance strategy selection requires an architecture that supports service QoS monitoring and provision of replica services. This could be too much to afford for strategy selection, for example, if it turns out that expensive strategies involving replica nodes are not appropriate. This paper can be complementary to their approach but it is more lightweight by merely recommending which fault tolerance strategies are likely to match service characteristics that are of concern to service designers.
III. FAULT TOLERANCE PATTERNS
In our approach, the following fault tolerance patterns are supported (Fig. 1). They are addressed in Section II and can be expressed using BPEL which is the target implementation of our fault tolerant services. Here the term “service” to which a pattern will be applied refers to the smallest unit of service provision, e.g., an operation of a Web service implementation.
Figure 1. Fault tolerance patterns.
Call Replica
Call Service
[Fail]
(1) Retry (2) Wait
(4) RB NVP
[Fail]
(3) RB Replica
(5) Active Replica (6) Active NVP
(7) Voting Replica (8) Voting NVP
(9) Retry + Wait
[WaitCondition] Call Service
[Fail and RetryCondition]
Call Service
[Fail and RetryCondition]
[ WaitCondition]
Call Service
Call Service
The Eighth International Conference on Computing and Information Technology IC2IT 2012
49
1) Retry: When service invocation is not successful, invocation to the same service is repeated until it succeeds or a condition is evaluated to true. A common condition is the allowed retry times.
2) Wait: Service invocation is delayed until a specified time. If the service is expected to be busy or unavailable at a particular time, delaying invocation until a later time could help decrease failure probability.
3) RecoveryBlockReplica: When service invocation is not successful, invocation is made sequentially to a number of functionally equivalent alternatives (i.e., recovery blocks) until the invocation succeeds or all alternatives are used. Here the alternatives are replicas of the original service; they can be different copies of the orignal service but are provided in different execution environments.
4) RecoveryBlockNVP: This pattern is similar to 3) but adopts N-version programming (NVP). Here the original service and its alternatives are developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This would be more reliable than having replicas of the original services as alternatives since it can decrease the failure probability caused by faults in the original service.
5) ActiveReplica: To increase the probability that service invocation will return in a timely manner, invocation is made to a group of functionally equivalent services in parallel. The first successful response from any service is taken as the invocation result. Here the group are replicas of each other; they can be different copies of the same service but are provided in different execution environments.
6) ActiveNVP: This pattern is similar to 5) but adopts NVP. Here the services in the group are developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This would be more reliable than having the group as replicas of each other since it can decrease the failure probability caused by faults in the replicas.
7) VotingReplica: To increase the probability that service invocation will return a correct result despite service faults, invocation is made to a group of functionally equivalent services in parallel. Given that there will be several responses from the group, one of the voting algorithms can be used to determine the final result of the invocation, e.g. majority voting. Here the group are replicas of each other; they can be different copies of the same service but are provided in different execution environments.
8) VotingNVP: This pattern is similar to 7) but adopts NVP. Here the services in the group are developed by different development teams or with different technologies, algorithms, or programming languages, but they may be provided in the same or different execution environment.
9) Retry + Wait: This pattern is an example of a possible combination of different patterns. When service invocation is
not successful, invocation is retried for a number of times and, if still unsuccessful, waits until a specified time before another invocation is made.
All patterns except Wait employ redundancy. Retry is a form of time redundancy taking extra communication time to tolerate faults whereas RecoveryBlock, Active, and Voting employ space redundancy using extra resources to mask faults [10]. RecoveryBlock uses the passive replication technique; invocation is made to the original (primary) service first and alternatives (backup services) will be invoked only if the original service or other alternatives fail. Active and Voting both use the active replication technique; all services in a group execute a service request simultaneously, but they determine the final result differently. Retry, Wait, and RecoveryBlock can help tolerate system and network faults. Voting can be used to mask logic faults, e.g., when majority voting is used and the majority of service responses are correct. It can even detect logic faults if a correct response is known. Active can help with SLA faults that relate to late service responses.
IV. SERVICE CHARACTERISTICS
The following are the criteria regarding service characteristics and condition of service execution environment which the service designer/provider will consider for a particular service. These characteristics will influence the recommendation of fault tolerance patterns for the service.
1) Transient Failure: The service environment is generally reliable and potential failure would only be transient. For example, the service may be inaccessible at times due to network problems, but a retry or invocation after a wait should be successful.
2) Instance Specificity: The service is specific and consumers are tied to use this particular service. It can be that there are no equivalent services provided by other providers, or the service maintains specific data of the consumers. For example, a CheckBalance service of a bank is specific because a customer can only check an account balance through the service of this bank with which he/she has an account, and not through the services of other banks.
3) Replica Provision: This relates to the ability of the service designer/provider to accommodate different replicas of the service. The replicas should be provided in different execution environments, e.g., on different machines or processing different copies of data. This ability helps improve reliability since service provision does not rely on a single service.
4) NVP Provision: This relates to the ability of the service designer/provider to accommodate different versions of the service. The service versions may be developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This ability helps improve reliability since service provision does not rely on any single version of the service.
5) Correctness: The service designer expects that the service and execution environment should be managed to
The Eighth International Conference on Computing and Information Technology IC2IT 2012
50
provide correct results. This relates to the quality of service environment to provide reliable communication, including the mechanisms to check for correctness of messages even in the presence of logic faults.
6) Timeliness: The service designer expects that the service and execution environment should be managed to react quickly to requests and give timely results.
7) Simplicity: The service designer/provider may be concerned with simplicity of the service. Provision for fault tolerance can complicate service logic, add more interactions to the service, and increase latency of service access. When service provision is more complex, more faults can be introduced.
8) Economy: The service designer/provider may be concerned with the economy of making the service fault tolerant. Fault tolerance patterns consume extra time, costs, and computing resources. For example, sequential invocation is cheaper than parallel invocation to the group of services, and providing relplicas of the service is cheaper than NVP.
V. FAULT TOLERANCE PATTERNS RECOMMENDATION
The recommendation of fault tolerance patterns to a service is based on what characteristics the service possesses and which patterns suit such characteristics.
A. Service Characteristics-Fault Tolerance Patterns Relationship
We first define a relationship between service characteristics and fault tolerance patterns as in Table I. Each cell of the table represents the relationship level, i.e., how well the pattern can respond to the service characteristic. The relationship level ranges from 0 to 8 since there are eight basic patterns. Level 8 means the pattern responds very well to the characteristic, level 7 responds well, and so on. Level 0 means there is no relationship between the pattern and service characteristic.
For example, for Economy, Retry and Wait are cheaper than other patterns that employ space redundancy since both of them require only one service implementation. But Wait responds best to economy (i.e., level 8) since there is only a single call to the service whereas Retry involves multiple invocations (i.e., level 7). Sequential invocation in RecoveryBlock is cheaper than parallel invocation in Active and Voting because not all service implementations will have to be invoked; a particular alternative of the service will be invoked only if the original service and other alternatives fail, whereas parallel invocation requires that different service implementations be invoked simultaneously. RecoveryBlockReplica (level 6) is cheaper than RecoveryBlockNVP (level 5) because providing replicas of the service should cost less than development of NVP. Similarly ActiveReplica (level 4) is cheaper than ActiveNVP (level 3) and VotingReplica (level 2) is cheaper than VotingNVP (level 1). Note that Voting is more expensive than Active due to development of a voting algorithm to determine the final result. For a combination of patterns such as Retry+Wait, the relationship level is an average of the levels of the combining patterns.
TABLE I. RELATIONSHIP BETWEEN SERVICE CHARACTERISTICS AND FAULT TOLERANCE PATTERNS
Service Characteristics
Fault Tolerance Patterns Retry Wait RB
Replica RBNVP
Active
Replica Active
NVP Voting Replica
VotingNVP
Retry+Wait
Transient Failure (TF) 8 7 0 0 0 0 0 0 7.5
Instance Specificity (IS) 8 8 7 6 5 4 5 4 8
Replica Provision (RP) 0 0 8 0 8 0 8 0 0
NVP Provision (NP) 0 0 0 8 0 8 0 8 0
Correctness (CO) 2 2 3 4 5 6 7 8 2
Timeliness (TI) 4 1 5 6 7 8 2 3 2.5
Simplicity (SI) 8 8 7 6 5 4 3 2 8
Economy (EC) 7 8 6 5 4 3 2 1 7.5
For the relationship between other characteristics and the patterns, we reason in a similar manner. Retry and Wait suit the environment with Transient Failure. The patterns that rely on the execution of a single service at a time respond better to Instance Specificity than those that employ multiple service implementations. Replica Provision and NVP Provision are relevant to the patterns that employ space redundancy. For Correctness, Voting is the best since it is the only pattern that can mask/detect byzantine failure (i.e., the case that the services give incorrect results). Active is better than RecoveryBlock with regard to byzantine failure because the chance of getting the result that is incorrect should be lower than the case of RecoveryBlock due to the fact that the result of Active can come from any one of the redundant services that are invoked in parallel. Retry and Wait do not suit Correctness since they rely on the execution of a single service. For Timeliness, the comparison of the patterns on time performance given in [2], [3] (ranked in descending order) is as follows: Active, RecoveryBlock, Retry, Voting, Wait. For Simplicity, the logic of Retry and Wait which involves a single service is the simplest.
B. Assessment of Service Characteristics
The next step is to have the service designer assess what characteristics the service possesses; the characteristics would influence pattern recommendation.
1) Identify Dominant Characteristics: The service designer will consider service semantics and condition of service provision, and identify dominant characteristics that should influence pattern recommendation. For each characteristic that is of concern, the service designer defines a dominance level. Level 1 means the characteristic is the most dominant (i.e., ranked 1st), level 2 means less dominant (i.e., ranked 2nd), and so on. Level 0 means the service does not have the characteristic or the characteristic is of no concern.
For example, during the design of a CheckBalance service of a bank, the service designer considers Instance Specificity as the most dominant characteristic (i.e., dominance level 1) since
The Eighth International Conference on Computing and Information Technology IC2IT 2012
51
bank customers would be tied to their bank accounts that are associated with this particular service. From experience, the designer sees that the computing environment of the bank provides a reliable service and if there is a problem, it is generally transient, and hence a simple fault handling strategy is preferred (i.e., Transient Failure and Simplicity have dominance level 2). Nevertheless, the designer is able to afford exact replicas of the service if something more serious happens (i.e., Replica Provision has dominance level 3). Suppose the designer is not concerned with other characteristics, then the others would have dominance level 0. Table II shows the dominance level of all characteristics of this CheckBalance service.
2) Convert Dominance Level to Dominance Weight: a) Convert Dominance Level to Raw Score: The
dominance level of each characteristic will be converted to a raw score. The most dominant characteristic gets the highest score which is equal to the dominance level of the least dominant characteristic that is considered. Less dominant characteristics get less scores accordingly. From the example of the CheckBalance service, Replica Provision has the least dominance level of 3, so the raw score of the most dominant characteristic – Instance Specificity – is 3. Then the score for Transient Failure and Simplicity would be 2, and Replica Provision gets 1. Table III shows the raw scores of the service characteristics.
b) Compute Dominance Weight: First, divide 1 by the summation of the raw scores. For example, for the CheckBalance service, the summation of the raw scores in Table III is 8 (2+3+1+0+0+0+2+0) and the quotient would be 1/8 (0.125). Then, multiply this quotient with the raw score of each characteristic. The result would be the dominance weights of the characteristics (where the summation of the weights is 1). The weights will be used later in the recommendation model. For the CheckBalance service, the dominance weights of all characteristics are shown in Table IV.
C. Fault Tolerance Patterns Recommendation Model
We propose a model for fault tolerance patterns recommendation as in (1)
TABLE II. DOMINANCE LEVELS OF SERVICE CHARACTERISTICS
Service Characteristics Transient Failure
Instance Specificity
Replica Provision
NVP Provision
Correct ness
Timeli ness
Simplicity
Economy
Level 2 1 3 0 0 0 2 0
TABLE III. RAW SCORES OF SERVICE CHARACTERISTICS
Service Characteristics Transient Failure
Instance Specificity
Replica Provision
NVP Provision
Correct ness
Timeli ness
Simplicity
Economy
Score 2 3 1 0 0 0 2 0
TABLE IV. DOMINANCE WEIGHTS OF SERVICE CHARACTERISTICS
Service Characteristics Transient Failure
Instance Specificity
Replica Provision
NVP Provision
Correct ness
Timeli ness
Simplicity Economy
Score 0.25 0.375 0.125 0 0 0 0.25 0
P = D x R (1)
where P = A vector of fault tolerance pattern scores
D = A vector of dominance weights of service characteristics as computed in Section V.B
R = A relationship matrix between service characteristics and fault tolerance patterns as proposed in Section V.A
Therefore, given R as
Retry RBReplica ActiveReplica VotingReplica Retry+Wait Wait RBNVP ActiveNVP VotingNVP
R =
8 7 0 0 0 0 0 0 7.5
8 8 7 6 5 4 5 4 8
0 0 8 0 8 0 8 0 0
0 0 0 8 0 8 0 8 0
2 2 3 4 5 6 7 8 2
4 1 5 6 7 8 2 3 2.5
8 8 7 6 5 4 3 2 8
7 8 6 5 4 3 2 1 7.5
TF
IS
RP
NP
CO
TI
SI
EC
and, in the case of the CheckBalance service, D as
TF IS RP NP CO TI SI EC D = [ ]0.25 0.375 0.125 0 0 0 0.25 0.
The pattern recommendation P would be
Retry RBReplica ActiveReplica VotingReplica Retry+Wait Wait RBNVP ActiveNVP VotingNVP P = [ ]7.00 6.75 5.38 3.75 4.12 2.50 3.62 2.00 6.88.
The recommendation says how well each pattern suits the
service according to the characteristic assessment. The pattern with the highest score would be best suited for the service. Since the designer of the CheckBalance service pays most attention to Instance Specificity, Transient Failure, and Simplicity, the designer inclines to rely on reliable provision of a single service. The patterns that respond well to these characteristics, i.e, Retry, Wait, and Retry+Wait, are among the first to be recommended. Here, Retry is the best-suited pattern with the highest score. Since the designer can provide replica services as well but still has simplicity in mind, RecoveryBlockReplica is the next to be recommended. Voting
The Eighth International Conference on Computing and Information Technology IC2IT 2012
52
patterns and those which require NVP services are more complex strategies, so they get lower scores.
D. Generation of Fault Tolerant Service
A software tool has been developed to support fault tolerance patterns recommendation and generation of fault tolerant services as a BPEL service. The service designer will first be prompted to select service characteristics that are of interest, and then specify a dominance level for each chosen characteristic. The tool will calculate and rank the pattern scores as shown in Fig. 2 for the CheckBalance service. The designer can choose one of the recommended patterns and the tool will prompt the designer to specify the WSDL of the service together with any parameters necessary for the generation of the BPEL version. For Retry, the parameter is the number of retry times. For RecoveryBlock, Active, and Voting, the parameter is a set of WSDLs of all service implementations involved. For Wait, the parameter is the wait-until time. In this example, Retry is chosen and the number of retry times is 5. Then, a fault tolerant version of the service will be generated as a BPEL service for GlassFish ESB v2.2 as shown in Fig. 3. The BPEL version invokes the service in a fault tolerant way, implementing the pattern structure we adopt from [2], [7].
Figure 2. Pattern recommendation by supporting tool.
Figure 3. BPEL structure for Retry.
VI. CONCLUSION
In this paper, we propose a model to recommend fault tolerance patterns to services. The recommendation considers service characteristics and condition of service environment. A supporting tool is developed to assist in the recommendation and generation of fault tolerant service versions as BPEL services. As mentioned earlier, it is a lightweight approach which helps to identify fault tolerance patterns that are likely to match service characteristics according to subjective assessment of service designers. At present the recommendation is aimed for a single service. The approach can be extended to accommodate pattern recommendation and generation of fault tolerant composite services. More combinations of patterns can also be supported. In addition, we are in the process of trying the model with services in business organizations for further evaluation.
REFERENCES
[1] M. P. Papazoglou, Web Services: Principles and Technology. Pearson
Education Prentice Hall, 2008.
[2] A. Liu, Q. Li, L. Huang, and M. Xiao, “FACTS: A framework for fault–tolerant composition of transactional Web services”, IEEE Trans. on Services Computing, vol.3, no.1, 2010, pp. 46-59.
[3] Z. Zheng and M. R. Lyu, “An adaptive QOS-aware fault tolerance strategy for Web services”, Empirical Software Engineering, vol.15, issue 4, 2010, pp. 323-345.
[4] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing”, IEEE Trans. on Dependable and Secure Computing, vol.1, no.1, 2004, pp. 11-33.
[5] G. Dobson, “Using WS-BPEL to implement software fault tolerance for Web services”, In Procs. of 32nd EUROMICRO Conf. on Software Engineering and Advanced Applications (EUROMICRO-SEAA’06), 2006, pp. 126-133.
[6] J. Lau, L. C. Lung, J. D. S. Fraga, and G. S. Veronese, “Designing fault tolerant Web services using BPEL”, In Procs. of 7th IEEE/ACIS Int. Conf. on Computer and Information Science (ICIS 2008), 2008, pp. 618-623.
[7] T. Thaisongsuwan and T. Senivongse, “Applying software fault tolerance patterns to WS-BPEL processes ”, In Procs. of Int. Joint Conf. on Computer Science and Software Engineering (JCSSE2011), 2011, pp. 269-274.
[8] R. Hanmer, Patterns for Fault Tolerant Software. Chichester: Wiley Publishing, 2007.
[9] Z. Zheng and M. R. Lyu, “A QoS-aware fault tolerant middleware for dependable service composition”, In Procs. of IEEE Int. Conf. on Dependable Systems & Networks (DSN 2009), 2009, pp. 239-249.
[10] Z. Zheng and M. R. Lyu, “Optimal fault tolerance strategy selection for Web services”, Int. J. of Web Services Research, vol.7, issue 4, 2010, pp.21-40.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
53
Development of Experience Base Ontology to Increase Competency of Semi-automated ICD-10-TM
Coding System
Wansa Paoin Faculty of Information Technology
King Mongkut’s University of Technology, North Bangkok
Bangkok, Thailand [email protected]
Supot Nitsuwat Faculty of Information Technology
King Mongkut’s University of Technology, North Bangkok
Bangkok, Thailand [email protected]
Abstract— The objectives of this research were to create the International Classification of Diseases, 10th edition, Thai Modification - ICD-10-TM experience base ontology, to test usability of the ICD-10-TM experience base with knowledge base in a semi-automated ICD coding system, and to increase competency of the system. ICD-10-TM experience base ontology was created by collecting 4,880 anonymous patient records coded into ICD codes from 32 volunteer expert codes working in different hospitals. Data were checked for misspelling and mismatch elements and converted into experience base ontology using n-triple (N3) format of resource description framework. The semi-automated coding software could search experience base when initial searching from ICD knowledge base yielded no result. Competency of the semi-automated coding system was tested using another data set contain 14,982 diagnosis from 5,000 medical records of anonymous patients. All ICD codes produced by the semi-automated coding system were checked against the correct ICD codes validated by ICD expert coders. When the system use only ICD knowledge base for automated coding, it could find 7,142 ICD codes (47.67%), recall = 0.477, precision =0.909 , but when it used ICD knowledge base with experience base search, it could find 9,283 ICD codes (61.96%), recall = 0.677, precision = 0.928. This increase ability of the system was statistical significant (paired T-test p-value = 0.008 (< 0.05). This research demonstrated a novel mechanism to use experience base ontology to enhance competency of semi-automate ICD coding system. The model of interaction between knowledge base and experience base developed in this work could be used as a basic knowledge for development of other computer systems to compute intelligence answer for complex questions as well. Keywords-experience base, knowledge base¸ ontology, semi-automated ICD coding system
I. INTRODUCTION Ontology is a data structure, a data representation tool to
share and reuse knowledge between artificial intelligence systems which share a common vocabulary. Ontology could be used as a knowledge base for computer system to compute intelligence answer for complex questions like ICD-10-TM (The International Classification of Diseases and Related Health Problems, 10th Revision, Thai Modification) [1] coding.
ICD-10 is a classification that was created and maintained by the World Health Organization –WHO since 1992 [2]. The electronic versions of ICD-10 was released in 2004 as a browsing software in CD-ROM package [3] and as ICD-10 online on WHO website [4]. Both electronic versions provided only a simple word search service that facilitate only minor part of the complex ICD coding processes. Since 2000 AD, some countries add more codes from medical expert opinions into ICD-10 so ICD-10 was modified in some countries e.g. Australia, Canada, Germany etc. In Thailand ICD-10 was modified to be ICD-10-TM (Thai Modification) since 2000 [5] and is maintained by Ministry of Public Health, Thailand .
ICD coding is an important task for every hospital. After a medical doctor complete treatment for a patient, the doctor must summarized all diagnosis of the patient into a form of diagnosis and procedures summary. Then a clinical coder will start ICD coding for that case using manual ICD coding process which use two ICD books as reference sources. All ICD codes for each patient will be used for morbidity and mortality statistical analysis and reimbursement of medical care cost in hospital. Manually ICD coding processes are complex. The ICD coding could not be finished merely by word matching between diagnosis words and list of ICD codes/labels, a clinical coder may assign two different ICD codes for two patients with same diagnosis word based on each patient context. Unfortunately, this complexity of ICD-coding were not recognized by most researchers who tried to develop semi-automated and automated ICD coding systems in the past.
Several research works mentioned automated ICD coding
process in their researches. Diogene 2 program [6] built medical terminology table and used it to map diagnosis word into morphosemantem (word-form) layer, then converted the term into concept layer before matching to labels of ICD codes in expression layer. Heja et al [7] did matching diagnosis words with list of ICD code labels and suggested that hybrid model yield better matching results. Pakhomov et al [8]
The Eighth International Conference on Computing and Information Technology IC2IT 2012
54
designed an automated coding system to assign codes for out-patient diagnosis using example-based and machine learning techniques. Periera et al [9] built a semi-automated coding help system using an automated MeSH-based indexing system and a mapping between MeSH and ICD-10 extracted from the UMLS metathesaurus. These previous works, only used word matching approach processes and never covered full standard ICD coding processes, which had been summarized in ICD-10 volume-2 [10].
In our previous work [11] we had created ICD-10-TM ontology as a knowledge base for development of semi-automated ICD coding. ICD-10-TM ontology contains 2 main knowledge bases i.e. tabular list knowledge base and index knowledge base with 309,985 concepts and 162,092 relations. Tabular list knowledge base could be divided into upper level ontology, which defined hierarchical relationship between 22 ICD chapters, and lower level ontology which defined relations between chapters, blocks, categories, rubrics and basic elements (include, exclude, synonym etc.) of the ICD tabular list. Index knowledge base described relation between keywords, modifiers in general format and table format of the ICD index.
ICD-10-TM ontology was implemented in semi-automated ICD-10-TM coding software as a knowledge base. The software was distributed by the Thai Health Coding Center, Ministry of Public Health, Thailand [12]. The coding algorithms will search matching keywords and modifiers from the index ontology and diagnosis knowledge base, then verify code definition, include and exclude conditions from tabular list ontology. The program will display all ICD-10-TM codes found or not found to the clinical coder, then the human coder could accept the codes or change to other codes based on her judgment and standard coding guideline. Users survey revealed good results got from ontology search with high user satisfaction (>95%) on well usability of the ontology. When we tried to use the system to do automate coding i.e. to code all diagnosis before a clinical coder start coding, to reduce number of diagnosis to be coded by clinical coder, we found that the automated coding work based on the ICD-10-TM ontology could successfully code diagnosis words for 24-50% of all diagnosis words. To increase competency of the system, we created another ontology call “experience base” to help the system to be able to code more diagnosis words than previously done.
In this paper, we present ICD-10-TM experience base and the application of a novel mechanism using experience base ontology to enhance competency of semi-automate ICD coding system. The model of interaction between knowledge base and experience base developed in this work could be used as a basic knowledge for development of other computer systems to compute intelligence answer for complex questions as well.
II. METHODOLOGY To create knowledge base, we asked all expert coders in
Thailand to volunteer participate in this project. An expert coder must had at least 10 years experience on ICD coding, or had passed the examination for certified coder (intermediate level) by the Thai Health Coding Centers, Ministry of Public Health to be able to participate. The project committee selected 42 expert coders from 198 volunteers based on their ability to devote time for the project, hospital size, hospital location where the coders work and competency on using computer and software.
All selected expert coders attended one day training on how to use the semi-automated coding system. Each of them was assigned to use the system to do ICD coding. They used medical records of patients admitted into their hospital during January to November 2011 as input to the system. The input data did not include patient identification data. Only sex, age and obstetrics condition of each patient must be input into the system since these data elements, as well as all diagnosis words, are essential for ICD code selection by the system. Each expert coder must input at least 100 different cases into the system within 30 days. After finishing their task, each coder sent the saved data to the project coordinator by email. Data from all expert coders were checked for misspelling and mismatch elements (for example, a male patient could not be obstetrics case). Records of patient type with each diagnosis word and ICD code from every cases were created using n-triple (N3) format of resource description framework – RDF [10] to built the experience base ontology. The ontology was built into the system using inverted index structure by transforming into Lucene 3.4 [13] search engine library which is the core engine of the semi-automated ICD coding system. The new semi-automated coding system now has another ontology - ICD experience base created from expert coders work. The automated coding algorithm had one new step. This step will be executed when searching from ICD knowledge base yielded no result. When ICD code was not found after searching from ICD knowledge base, the system will search from ICD experience base. Sometimes ICD code of a diagnosis with the same patient context varies from one expert opinion to another, the system will select the ICD code with highest frequency of expert opinion.
Competency of the semi-automated coding system was tested using another set of patient data. This dataset contains 14,982 diagnosis from 5,000 medical records of patients admitted during January to June 2011, into another hospital which did not participate in the knowledge base creation. Every ICD codes in this dataset were validated for 100% accuracy by another three expert coders. All ICD codes produced by the semi-automated coding system when using knowledge base only and when using knowledge base with experience base were checked against the correct ICD codes in the dataset for accuracy.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
55
III. RESULTS By the end of the project 4,880 diagnosis words and patient
context were collected from 32 expert coders. Ten expert coders did not send the cases within the dateline, so their data were excluded from analysis in this phase. All 4,880 diagnosis words and patient context were used to created the experience base ontology. A python script written and used to transform each record from comma separate value file format to RDF N3 files.
The experience base ontology contains five concepts and
four relations as shown in Table 1. Each diagnosis word in a patient record could be uniquely identified. Each ICD expert opinion on the ICD code that should be used for each diagnosis word based on the patient context was an important concept in the ontology. All these concepts and relations were used to construct all RDF statements in the experience base ontology. For example if an expert ‘[email protected]’ gave an opinion that a diagnosis word ‘disseminated tuberculosis’ in a patient context ‘man not newborn’ should be coded to ICD code ‘A18.3’, the RDF statements in the N3 format will be written as the following phrase; dxword: disseminated_tuberculosis word:hasPtDxId ptdxid:001 . ptdxid:001 pt:isA ptcontext: man_not_newborn . ptdxid:001 icd:codeBy expert:[email protected] . ptdxid:001 icd:hasCode icd10:A183 .
The experience ontology concepts and relations can be presented as a graph data as in Figure 1.
TABLE I. ALL EXPERIENCE BASE CONCEPTS AND RELATIONS IN RDF N3 FORMAT
Experience Base Concepts
Ontology type
RDF Format
Example
Diagnosis Word
Concept dxword: dxword: disseminated_tuberculosis
PatientDiag ID
Concept ptdxid: ptdxid:001
Patient Context
Concept ptcontext: ptcontext:man_not_newborn
Expert Concept expert: expert:[email protected] ICD10 Code Concept icd10: icd10:A183 hasPtDxId Relation word:hasPt
DxId dxword:dyslipidemia
word:hasPtDxId ptdxid:101 isA Relation pt:isA ptid:101 pt:isA ptcontext:
man_not_newborn codeBy Relation icd:codeBy ptid:101 icd:codeBy expert:abc hasCode Relation icd:hasCode Ptdxid:101 icd:hadCode
icd10:E78.5
The system was used to automate coded 14,982 diagnosis in the test dataset. When the system use only ICD knowledge base, it could find 7,142 ICD codes (47.67%), but when it used ICD knowledge base with experience base search, the system could find 9,283 ICD codes (61.96%). This increase ability was tested for statistical significant using paired T-test with alpha value = 0.05. T-stat = -79.30 with p-value = 0.008 (< 0.05).
Recall and precision of the system were calculated. The recall and precision value when the system used ICD knowledge base only were 0.477 and 0.909 , while the recall and precision value when the system used ICD knowledge base with experience base were 0.677 and 0.928.
Figure 1. A part of the ICD experience base. A diagnois word “Dyslipidemia” in each patient record could be code to various ICD codes,
based on each expert opinion and each patient context.
IV. DISCUSSION ICD-10 coding is not a simple word matching process.
Qualified human ICD coders will never do simple diagnosis word search or browse the diagnosis term from a list of ICD codes and labels. Unfortunately, research on semi-automated and automated ICD coding system in the past [6-9] never recognize this important concept. This finding explained why there is no real workable automated ICD coding system until now.
ICD index and tabular list of disease were created since 1992, diagnosis words in ICD did not include every synonym, alternative name or some specific diagnosis in highly specialized medical service. On the other hand, ICD classification added some patient context into classification scheme, this made coding for one disease name may produced different ICD codes if the patient context change. For example an ICD code for diagnosis “internal hemorrhoids” would be O22.4 when the patient was a pregnant woman, but the code will be I84.2 for an adult man patient. These facts made ICD coding a complex job and need human coders. A clinical coder must know how to change some diagnosis word when first round searching could not find the code. She must had patient records in hand all the time she was coding to check necessary patient context that may affected correct ICD code choosing.
Our semi-automated ICD coding system was not developed to replace all the clinical coders work on ICD coding. But if the system could find initial ICD codes for some diagnosis word summarized by the medical doctor, the coder works will be reduced in some extents. Our system used ICD ontology created from ICD-10-TM alphabetical index and tabular list of
Dyslipidemia101
abc E78.5
Man, not newborn
102
abc
E78.5
Women, not newborn, not preg
103
abc
E78.6
Man, not newborn
Diagnosis Word Ptdx ID
Expert ID ICD
Patient Context
The Eighth International Conference on Computing and Information Technology IC2IT 2012
56
disease as knowledge bases to search for correct ICD code for each diagnosis word + patient context. Automated coding base on this knowledge could code 47.67% of all diagnosis with good accuracy (90.9%).
Recall ability of the old system was low because in real world medical records there are many varieties of words that the doctors may used for diagnosis. Some words are new words which occured after ICD-10 creation, for examples “dyslipidemia, chronic kidney disease, diabetes mellitus type 2” are more common used by doctors today than the old words “hyperlipidemia, chronic renal failure, non-insulin dependent diabetes mellitus” found in ICD-10.
Adding experience base created from real world cases into the system could increase recall ability of the system. ICD experience base ontology contains diagnosis words from real medical records with assigned ICD codes for these new words. So the system will search the experience base if first round searching from knowledge base yield no ICD code. Recall ability of the system increased from 0.477 to 0.677 with good precision ability (0.928).
Different expert opinions for same diagnosis were anticipated to be found in the experience base. In facts a consensus of expert opinion was rarely found in ICD coding experience base. Varieties of expert opinions on coding of some diagnosis words were shown in Table 2. The system will choose code with the highest frequency to be used as a “correct” code. This strategy should be good unless there were too few opinions for some rare diagnosis words.
TABLE II. EXPERT OPINION OF SOME DIAGNOSIS WORD IN ICD EXPERIENCE BASE ONTOLOGY
Diagnosis words
ICD codes from expert opinion
Highest frequency code
Dyslipidemia E78.5, E78.6, E78.9 E78.5 (64.5%)
Chronic kidney disease
N18.0, N18.9, N19 N18.9 (35.5%)
Triple vessels disease
I21.4, I25.1, I25.9, N18.9
I25.1 (80%)
Diabetes mellitus type 2
E11.9, E11 E11.9 (95.8%)
Although the ICD experience base ontology at this stage contains only 4,880 cases. This experiment encouraged usage of experience ontology to increase recall ability of the semi-automated ICD coding system. In future research work, we plan to add more cases into the experience base and will try to test the ability of the system with more test data.
V. CONCLUSION ICD experience base ontology could be created using ICD
codes from medical records which was coded by expert coders. This experience base ontology was implemented into the semi-automated ICD coding system. Searching from experience base was very useful when first round searching
from knowledge base yielded no result. The recall ability of the system could be increased by adding experience base searching into its algorithm with good precision ability still was preserved.
ACKNOWLEDGMENT This research was supported by the Thai National Health Security Office, Thai Health Standard Coding Center (THCC), Ministry of Public Health, Thailand and Thai Collaborating Center for WHO-Family of International Classification.
REFERENCES [1] Bureau of Policy and Strategy, Ministry of Public Health, International
Statistical Classification of Disease and Related Health Problems, 10th Revision, Thai Modification (ICD-10-TM). Nonthaburi, Thailand: The Ministry of Public Health, 2009.
[2] The World Health Organization, International Statistical Classification of Diseases and Related Health Problems, 10th Revision. Geneva, Switzerland: The World Health Organization, 1992.
[3] The World Health Organization, International Statistical Classification of Diseases and Related Health Problems, 10th Revision, 2nd Edition. Geneva, Switzerland: The World Health Organization, 2004.
[4] The World Health Organization. ICD-10 online [internet]. Geneva, Switzerland: The World Health Organization; 2011 [cited 2011 Jun 30]. Available from http://www.who.int/classifications/icd/en/.
[5] Bureau of Policy and Strategy, Ministry of Public Health, Thailand. International Statistical Classification of Disease and Related Health Problems, 10th Revision, Thai Modification (ICD-10-TM). Nonthaburi, Thailand: The Ministry of Public Health, Thailand: 2000.
[6] C. Lovis, R. Buad, A.M. Rassinoux, P.A. Michel and J.R. Scherrer, “Building medical dictionaries for patient encoding systems: A methodology,” in: Artificial Intelligence in Medicine. Heidelberg: Springer, 1997, pp. 373–380.
[7] G. Heja and G. Surjan, “Semi-automatic classification of clinical diagnoses with hybrid approach,” in: Proceedings of the 15th symposium on computer based medical system - CBMS 2002. IEEE Computer Society Press; 2002,pp. 347–352.
[8] S.V.S. Pakhomov, J.D. Buntrock and C.G. Chute. “Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques,” J Am Med Inform Assoc, 2006, 13 pp.516 –525.
[9] S. Periera, A. Neveol , P. Masari and M. Joubert, “Construction of a semi-automated ICD-10 coding help system to optimize medical and economic coding” in A. Hasman et al, editors. Ubiquity: Technologies for Better Health in Aging Societies, VA: IOS Press, 2006 pp.845-850.
[10] The World Health Organization. International Statistical Classification of Diseases and Related Health Problems, 10th Revision, 2nd Edition, Volume 2. Geneva, Switzerland: The World Health Organiztion; 2004. p.32.
[11] S. Nitsuwat and W. Paoin, “Development of ICD-10-TM ontology for semi-automated morbidity coding system in Thailand” Methods of Information in Medicine, in press.
[12] Semi-automated ICD-10-TM coding system [internet]. Nonthaburi, Thailand: The Thai Health Coding Center, Ministry of Public Health, Thailand; [cited 2011 Aug 12]. Available from : http://www.thcc.or.th/formbasic/regis.php.
[13] RDF Notation 3 [internet]: The World Wide Web Consortium; [cited 2011 Jun 12]. Available from: http://www.w3.org/DesignIssues/Notation3.
[14] Apache Lucene [internet]: The Apache Software Foundation; [cited 2012 Jan 24]. Available from http://lucene.apache.org/java/docs/index.html.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
57
Collocation-Based Term Prediction for Academic Writing
Narisara Nakmaetee*, Maleerat Sodanil* *Faculty of Information Technology
King Mongkut’s University of Technology North Bangkok Bangkok, Thailand
[email protected], [email protected]
Choochart Haruechaiyasak† †Speech and Audio Technology Laboratory (SPT)
National Electronics and Computer Technology Center Pathumthani, Thailand
Abstract—Research paper is a kind of academic writing which is a formal writing. Academic writing should not contain any mistake, otherwise it would make the authors look unprofessional. In general, academic writing is a difficult task especially for non-native speakers. Appropriate vocabulary selection and perfect grammar are two of many important factors that make the writing appear formal. In this paper, we propose and compare various collocation-based feature sets for training classification models to predict verbs and verb tense patterns for academic writing. The proposed feature sets include n-grams of both Part-of-Speech (POS) and collocated terms preceding and following the predicted term. From the experimental results, using the combination of Part-Of-Speech (POS) and selected terms yielded the best accuracy of 50.21% for term prediction and 73.64% for verb tense prediction.
Keywords: Academic writing; collocation; n-gram; Part-of-Speech (POS)
I. INTRODUCTION In broad definition, academic writing is any
writing done to fulfill a requirement of a college or university [16]. There are several academic document types such as book report, essay, dissertation and research paper. Academic writing is different from general writing because it is formal writing. Many factors contribute to the formality of a text; major influences include vocabulary selection, perfect grammar and writing structures. For researchers, academic writing is an important channel to publish their new knowledge, ideas or arguments. Any mistakes should not occur in academic writing, because these mistakes will make the researcher look unprofessional. Moreover, errors in academic writing may result in research paper’s rejection. Thus, academic writing is a difficult task especially for non-native speakers.
At present, there are many software packages to help researchers write research papers. The software could be classified into two groups: academic writing software and grammar checker software. For academic writing software, they provide academic writing style templates such as APA, MLA and Chicago. Moreover, they provide the page layout controller, reference and citation features. For grammar checker software, they provide grammar checking feature, spelling feature,
and grammar suggestion feature based on dictionary. Some of them provide general writing templates such as e-mail template and business letter template. From our review, we found that academic writing software cannot suggest suitable vocabularies for academic writing because they suggest vocabulary based on synonyms. The words, which are synonymous, may be formal or informal. Academic writing word should only be formal.
In this paper, we focus on two factors that impact on formal writing: vocabulary selection and perfect grammar. For vocabulary selection, there are two associated problems. The first problem is the appropriate word selection. Non-native speakers often have difficulty in selecting an appropriate vocabulary for academic writing because they tend to look up for a word in a dictionary and use it without considering the word sense. They probably do not know exact meaning of the word. Moreover, they often tend to use very basic vocabulary, instead of a more sophisticated one. For example, given two sentences as follows.
(1) We talk about the main advantages of our methodology.
(2) We discuss the main advantages of our methodology.
Even though sentences (1) and (2) have the same meaning, they use different verbs, talk about and discuss. Sentence (2) is more formal than sentence (1). The second problem is collocation. Collocation error is a common and persistent error type among non-native speakers. Due to the collocation error, a piece of writing may be lacked of significant knowledge which might cause loss of precision. For example, given the following two sentences,
(3) Numerous NLP applications rely search engine queries.
(4) Numerous NLP applications rely on search engine queries.
Sentence (3) contains a common error often made by a non-native speaker. For perfect grammar, the non-native speakers frequently write wrong grammar such as fragment, wrong verb tense usage. In this paper, we focus on two specific tasks: verb prediction and verb tense pattern prediction. Verb prediction is for suggesting a verb in a sentence which is suitable
The Eighth International Conference on Computing and Information Technology IC2IT 2012
58
for a given context. Verb tense pattern prediction is for suggesting correct tense for a given verb.
The remainder of this paper is organized as follows. In the next section, we review some related works in academic writings. Section III gives the details of our proposed approach. The experiments and discussion are given in Section IV. We conclude the paper and give the direction for future works in Section V.
II. RELATED WORKS There are some researches related to academic
writing which can be classified into two groups: phrasal expression extraction [3][10] and word suggestion [4][13]. Phrasal expression extraction approach is based on statistic and rule based algorithm for suggesting useful phrasal expression. Word suggestion approach adopts probabilistic model or machine learning to discover word association and to build a model for word suggestion.
Collocation is a group of two or more words that usually go together. It is useful for helping English learners improve their fluency. Moreover, we can predict the meaning of the expression form the meaning of the parts [7]. Consequently, collocation information is useful for natural language processing. Collocations include noun phase, phrasal verb, and the other stock phases [7]. However, in our study we focus on phrasal verb. There are many works related to collocation. Collocation has been presented with four groups: lexical ambiguity resolve [1][8][14], machine translation [5][6][11], collocation extraction [9][12], and collocation suggestion [4] [13] [15].
For collocation extraction and suggestion, Wrad Church and Hanks [12] proposed techniques that used mutual information to measure the association between words. Pearce [9] described a collocation extraction technique by using WordNet. The technique relied on a synonym mapping for each of word senses. Futagi [2] discussed how to deal with some of the “non-pertinent” English language learners in the development of an automated tool to detect miscollocations in learner texts significantly reduces possible tool errors. Their work focused on the factors that affected the design of collocation detection tool. Zaiu Inkpan and Hirst [15] presented an unsupervised method to acquire knowledge about the collocational behavior of near-synonyms. They used mutual information, Dice, chi-square, log-likelihood, and Fisher’s exact test to measure the degree of association between two words. Li-E Liu, Wible, and Tsao [4] proposed a probabilistic collocation suggestion model which incorporated three features: word association strength, semantic similarity and the notion of shared collocations. Wu, Chang, Mitamura, and S.Chang [13] introduced machine learning model based on result of classification to provide verb-noun collocation suggestion. They extracted the collocation which comprised components having a syntactic relationship with another one word. In this paper, we construct the
features set based on collocation: POS and collocated term.
III. OUR PROPOSED APPROACH In this section, we describe the details of different
feature set approaches for verb and verb tense pattern prediction. Both approaches are based on the collocation: Part-of-Speech (POS), and collocated terms. Fig. 1 illustrates the process for preparing the feature sets for training the prediction models. Firstly, we start by collecting a large number of research papers from the ACL Anthology website for developing our corpus. Secondly, we convert the pdf file format papers into text files and extract the abstracts from the text files. Next, we extract sentences from the documents. Then, we tokenize the input sentences into tokens. Furthermore, we tag the tokens in a sentence with the POS. Given a sentence from our corpus in (5), the process of POS tagging yields the result in sentence (6). The POS tag set is based on the Penn Treebank II guideline [18].
Figure 1. The process of feature set extraction
The Eighth International Conference on Computing and Information Technology IC2IT 2012
59
(5) More specifically this paper focuses on the robust extraction of Named Entities from speech input where a temporal mismatch between training and test corpora occurs.
(6) More/RBR specifically/RB this/DT paper/NN focuses/VBZ on/IN the/DT robust/JJ extraction/NN of/IN Named/VBN Entities/NNS from/IN speech/NN input/NN where/WRB a/DT temporal/JJ mismatch/NN between/IN training/NN and/CC test/NN corpora/NN occurs/NNS ./.
Next, we identify verb and verb tense pattern in each sentence. Table I and Table II give some examples of verb and verb tense pattern.
TABLE I. EXAMPLE SENTENCES FOR VERB PREDICTION
Example sentences Verb tag
We present the technique of Virtual Annotation as a specialization of Predictive Annotation for answering definitional what is questions.
present
This paper proposes a practical approach employing n-gram models and error correction rules for Thai key prediction and Thai English language identification.
propose
This paper investigates the use of linguistic knowledge in passage retrieval as part of an open-domain question answering system.
investigate
In this paper, we demonstrate a discriminative approach to training simple word alignment models that are comparable in accuracy to the more complex generative models normally used.
demonstrate
We evaluate the results through measuring the overlap of our clusters with clusters compiled manually by experts.
evaluate
TABLE II. EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION
TABLE III. EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION
Example sentences Verb tense pattern tag
This/DT demonstration/NN will/MD motivate/VB some/DT of/IN the/DT significant/JJ properties/NNS of/IN the/DT Galaxy/NNP Communicator/NNP Software/NNP Infrastructure/NNP and/CC show/VB how/WRB they/PRP support/VBP the/DT goals/NNS of/IN the/DT DARPA/NNP Communicator/NNP program/NN ./.
/MD /VB
First/RB we/PRP describe/VBP the/DT CU/NNP Communicator/NNP system/NN that/WDT integrates/VBZ speech/NN recognition/NNS synthesis/NN and/CC natural/JJ language/NN understanding/NN technologies/NNS using/VBG the/DT DARPA/NNP Hub/NNP Architecture/NNP ./.
/VBP
BravoBrava/NNP is/VBZ expanding/VBG the/DT repertoire/NN of/IN commercial/JJ user/NN interfaces/NNS by/IN incorporating/VBG multimodal/JJ techniques/NNS combining/VBG traditional/JJ point/NN and/CC click/NN interfaces/NNS with/IN speech/NN recognition/JJ speech/NN synthesis/NN and/CC gesture/NN recognition/NN ./.
is /VBG
We/PRP have/VBP aligned/VBN Japanese/JJ and/CC English/JJ news/NN articles/NNS and/CC entences/NNS to/TO make/VB a/DT large/JJ parallel/NN corpus/NN ./.
have /VBN
Recently/RB confusion/NN network/NN decoding/NN has/VBZ been/VBN applied/VBN in/IN machine/NN translation/NN system/NN combination/NN ./.
has been /VBN
Example sentence
POS TAGGED Sentence In /IN contrast /NN to /TO previous /JJ work
N-gram term and POS pre3-
gram
Selected term
POS TAGGED Sentence /NN we /PRP particularly /RB focus /VBP exclusively /RB
N-gram term and POS
prePOS3-gram pre2-gram prePOS2
-gram pre1-gram prePOS1-gram post1-gram postPOS
1-gram
Selected term preNoun/ prePronoun preAdv postAdv
POS TAGGED Sentence on /IN clustering /VBG polysemic /JJ verbs /NNS
N-gram term and POS
post2-gram
postPOS2-gram
post3-gram
postPOS3-gram
Selected term post-Prepo
The Eighth International Conference on Computing and Information Technology IC2IT 2012
60
TABLE IV. EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION
Verb feature set Verb tense pattern feature set
Term-only 1-gram, 2-gram, 3-gram Term-only 1-gram, 2-gram, 3-gram
Term&POS 1-gram, 2-gram, 3-gram POS-only 1-gram, 2-gram, 3-gram
Selected Term-only
3-gram, preNoun/prePronoun, postNoun, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, 3-gram, preNoun/prePronoun, postNoun, postPrepo, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, postPrepo
Term&POS 1-gram, 2-gram, 3-gram
Selected Term&POS
3-gram, preNoun/prePronoun, postNoun, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, 3-gram, preNoun/prePronoun, postNoun, postPrepo, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, postPrepo
Selected Term&POS
1-gram, preNoun/prePronoun, postNoun, 2-gram, preNoun/prePronoun, postNoun, 3-gram, preNoun/prePronoun, postNoun
Then, we observe the POS tagged sentence and assign the feature labels as shown in Table III. From the observation of POS tagged sentences, we find that noun usually occurs in preceding and following position of a verb. Therefore, based on linguistic knowledge and observation, we select noun as part of our feature set.
In table III, “pre3-gram” and “prePOS3-gram” denote a term and a Part-of-Speech (POS) that are the previous third position of verb. Features “pre2-gram” and “prePOS2-gram” denote a term and a POS that occur in the previous second position of verb. Features “pre1-gram” and “prePOS1-gram” denote a term and a POS that exist in the previous position of verb. “post1-gram” and “postPOS1-gram” indicate a term and a POS that are the following position of verb. Features “post2-gram”, “postPOS2-gram”, “post3-gram” and “postPOS3-gram” indicate a term and a POS that exist in the second and third following position of verb. Feature “preNoun/pronoun” denotes a noun or pronoun that occurs in the preceding position of verb. Feature “preAdv” denotes an adverb that exists in the previous position of verb. Features “postAdv” and “postPrepo” indicate an adverb and a preposition that occur in the following position of verb.
The final feature labels based on the selected terms and POS tags are shown in Table IV. “Term-only” indicates the feature sets that include the terms occur in the preceding and following position of a verb. However, there are three Term-only feature sets: “1-gram Term-only”, “2-gram Term-only”, and “3-gram Term-only”. Feature set “1-gram Term-only” indicates a term that occurs in the previous and following position of a verb. Feature set “2-gram Term-only” denotes two terms that exist in the first and second preceding and following position of a verb. Feature set “3-gram Term-only” denotes three terms that occur in the first, second and third preceding and following position of a verb. “POS-only” represents the feature sets that consist of the POSs occur in the preceding
and following position of a verb. In the Table IV, there are three POS-only feature sets: “1-gram POS-only”, “2-gram POS-only”, and “3-gram POS-only”. “1-gram POS-only” indicates the feature set that includes a POS occur in the previous and following position of a verb. “2-gram POS-only” denotes the feature set that includes two POSs occur in the first and second preceding and following position of a verb. “3-gram POS-only” denotes the feature set that includes three POSs occur in the first, second and third previous and following position of a verb. “Term&POS” indicates the feature sets that consist of the terms and POSs occur in the preceding and following position of a verb. There are three Term&POS feature sets: “1-gram Term&POS”, “2-gram Term&POS”, and “3-gram Term&POS”. “1-gram Term&POS” indicates the feature set that includes one term and one POS occur in the previous and following position of a verb. “2-gram Term&POS” denotes the feature set that includes two terms and two POSs occur in the first and second preceding and following position of a verb. “3-gram Term&POS” denotes the feature set that includes three terms and three POSs occur in the first, second and third preceding and following position of a verb. For “Selected Term-only”, it represents the feature sets include of the 3-gram terms and POSs, the nouns or pronoun, or the adverbs, or a preposition that occur in the preceding and following position of a verb. There are four feature sets. The first feature set includes of the 3-gram terms, the nouns or pronoun that occur in the previous and following position of a verb. The second feature set consists of the 3-gram terms, the nouns or pronoun and the adverbs that exist in the previous and following position of a verb. The third feature set consists of the 3-gram terms, the nouns or pronoun, and a preposition that occur in the previous and following position of a verb. The last feature set consists of the 3-gram terms, the nouns or pronoun, the adverbs, and a preposition that occur in the previous and following position of a verb.
In table IV, there are two selected Term&POS
The Eighth International Conference on Computing and Information Technology IC2IT 2012
61
feature sets: “Selected Term&POS verb feature set”, “Selected Term&POS verb tense pattern feature set”. For “Selected &POS verb feature set”, there are four patterns. The first feature set includes of the 3-gram terms, the 3-gram POSs, the nouns or pronoun that occur in the preceding and following position of a verb. The second feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun and the adverbs that occur in the preceding and following position of a verb. The third feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun, and a preposition that exist in the preceding and following position of a verb. The last feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun, the adverbs, and a preposition that occur in the preceding and following position of a verb. For “Selected Term&POS verb tense pattern feature set”, there are three patterns. The first feature set includes of the 1-gram terms, the 1-gram POS, the nouns or pronoun that occur in the preceding and following position of a verb. The second feature set consists of the 2-gram terms, the 2-gram POSs, the nouns that occur in the preceding and following position of a verb. The third feature set that consists of the 3-gram terms, the 3-gram POSs, the nouns that occur in the preceding and following position of a verb.
IV. EXPERIMENTS AND DISCUSSION From the research paper archive, ACL Anthology
[17], we collected 3,637 abstracts from ACL and HLT conferences from 2000 to 2011. We have extracted 15,151 sentences. To evaluate the performance of all feature set approaches, we use the Naive Bayes classification algorithm with 10-fold cross validation on the data set.
For verb prediction, we selected the top 10 ranked verbs found in corpus. The top-10 verbs are be, describe, present, demonstrate, propose, achieve, use, evaluate, investigate, and compare. From our corpus, we selected 3,149 sentences which contain the above top-10 verbs for evaluating the verb prediction feature sets. There are 14 feature set approaches which can be classified into four groups: term-only, term&POS, selected term-only, and selected term&POS. Table V presents the performance evaluation of verb prediction feature sets based on the accuracy. From the table, it can be observed that the performance is improved when the number of n-gram increases. Using only POS does not increase the performance of verb prediction because POS tag is a linguistic category of word in the sentence. However, we found that the performance of POS and selected term of noun feature set is better than only selected term with noun feature set. Moreover, using adverb does not help increase the performance. On the other hand, preposition helps improve the performance. The reason is that some preposition is usually collocated with a verb such as “rely on”. In summary, the best feature set is by using 3-gram POS and selected terms of noun, pronoun, and preposition. The highest accuracy is equal to approximately 50%.
TABLE V. EVALUATION RESULTS FOR FEATURES SETS OF VERB PREDICTION
TABLE VI. EVALUATION RESULTS FOR FEATURES SETS OF VERB TENSE PATTERN PREDICTION
For verb tense pattern prediction, we used the corpus which contains 15,151 sentences. Similar to the verb prediction, verb tense pattern feature set can be classified into four groups: term-only, POS-only, term&POS, and selected term&POS. Table VI presents the performance evaluation based on accuracy. It can be observed that the performance of POS-only is quite low. When we combined selected terms with POS, the performance value was increased. However, the performance of POS and selected terms
Approach Accuracy (%)
Term-only
1-gram 42.89 2-gram 48.10
3-gram 49.03
Term&POS
1-gram 42.27
2-gram 47.31
3-gram 48.14
Selected Term-only
3-gram, preNoun/pronoun, postNoun 49.16
3-gram, preNoun/pronoun, postNoun, preAdv, postAdv 48.99
3-gram, preNoun/pronoun, postNoun, postPrepo 50.05
3-gram, preNoun/pronoun, postNoun, preAdv, postAdv, postPrepo
49.32
Selected Term&POS
3-gram, preNoun/pronoun, postNoun 49.29
3-gram, preNoun/pronoun, postNoun, preAdv, postAdv 49.16
3-gram, preNoun/pronoun, postNoun, postPrepo 50.21
3-gram, preNoun/pronoun, postNoun, preAdv, postAdv, postPrepo
49.95
Approach Accuracy (%)
Term-only
1-gram 68.71
2-gram 70.27
3-gram 69.99
POS-only
1-gram 67.52
2-gram 65.58
3-gram 65.14
Term&POS
1-gram 72.72
2-gram 70.96
3-gram 70.49
Selected Term&POS
1-gram Term&POS, preNoun/pronoun, postNoun 73.64
2-gram Term&POS, preNoun/pronoun, postNoun 72.44
3-gram Term&POS, preNoun/pronoun, postNoun 71.92
The Eighth International Conference on Computing and Information Technology IC2IT 2012
62
of noun feature set is better than the other feature set. The best feature set is 1-gram POS and selected terms of preNoun and pronoun. The reason is noun and pronoun could provide very good clue in predicting the verb tense since they act as subjects in the sentence.
V. CONCLUSION AND FUTURE WORKS We performed a comparative study on various
feature sets for predicting verb and verb tense pattern in sentences. Four feature sets based on the Part-of-Speech (POS) tags and selected terms, such as noun and pronoun, were evaluated in the experiments. We performed experiments by using the abstract corpus as data set and Naive Bayes as the classification algorithm. From the experiment results, verb prediction by using 3-gram of POS and selected terms of noun, pronoun, and preposition feature set yielded the best result of 50.21% accuracy. For the verb tense prediction, the highest accuracy of 73.64% was obtained by using 1-gram POS and selected terms of noun and pronoun.
For our future work, we will improve the performance of verb prediction by using WordNet. WordNet is a large lexical database. Using WordNet will help find synonyms of a word with appropriate word sense. Moreover, instead of multi-class classification model, we will adopt the one-against-all classification model for improving the verb prediction results.
REFERENCES [1] D. Biber, “Co-occurrence patterns among collocations: a tool
for corpus-based lexical knowledge acquisition,” Comput. Linguist. 19, pp. 531-538, 1993.
[2] Y. Futagi, “The effects of learner errors on the development of a collocation detection tool,” Proc. of the fourth workshop on Analytics for noisy unstructured text data, pp. 27-33, 2010.
[3] S. Kozawa, Y. Sakai, et al., “Automatic Extraction of Phrasal Expression for Supporting English Academic Writing”, Proc. of the 2nd KES International Symposium IDT 2010, pp.485-493, 2010.
[4] A. Li-E Liu, D. Wible, and N. Tsao, “Automated Suggestions for Miscollocations,” Proc. of the NAACL HLT Workshop on Innovative Use of NLP for Building Educational Applications, pp. 47-50, 2009.
[5] Z. Liu et al., “Improving Statistical Machine Translation with monolingual collocation,” Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 825-833, 2010.
[6] Y. Lu and M. Zhou, “Collocation translation acquisition using monolingual corpora,” Proc. of the 42nd Annual Meeting on Association for Computational Linguistics, 2004.
[7] C. Manning and H. Schütze, “Foundations of Statistical Natural Language Processing,” MIT Press. Cambridge, MA: May 1999.
[8] D. Martinez and E. Agirre, “One sense per collocation and genre/topic variations,” Proc. of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, pp. 207-215, 2000.
[9] D. Pearce, “Synonymy in Collocation Extraction,” Proc. of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, 2001.
[10] Y. Sakai, K. Sugiki, et al., “Acquisition of useful expressions from English research papers,” Natural Language Processing, SNLP '09, pp. 59-62, 2009.
[11] F. Smadja et al., “Translating collocations for bilingual lexicons: a statistical approach,” Comput. Linguist. Volume 22, pp.1-38, 1996.
[12] K. Wrad Church and P. Hanks, “Word Association Norms, Mutual Information, and Lexicography,” Proc. of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 76-83, 1989.
[13] J. Wu et al., “Automatic Colloacation Suggestion in Academic Writing,” Proc. of the ACL2010 Conference Short Papers, pp. 116-119, 2010.
[14] D. Yarowsky, “One sense per collocation,” Proc. of the workshop on Human Language Technology, pp. 266-271, 1993.
[15] D. Zaiu Inkpen and G. Hirst, “Acquiring collocations for lexical choice between near-synonyms,” Proc. of the ACL-02 workshop on Unsupervised lexical acquisition, pp. 67-76, 2002.
[16] “Academic writing definition”, available at: http://reference.yourdictionary.com/word-definitions/definition-of-academic-writing.html
[17] “ACL Anthology”, available at: http://aclweb.org/anthology-new/
[18] “Penn Treebank II Tags”, available at: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
The Eighth International Conference on Computing and Information Technology IC2IT 2012
63
Thai Poetry in Machine Translation An Analysis of Poetry Translation using Statistical Machine Translation
Sajjaporn Waijanya Faculty of Information Technology
King Mongkut’s University of Technology North Bangkok
Bangkok, Thailand [email protected]
Anirach Mingkhwan Faculty of Industrial and Technology Management
King Mongkut’s University of Technology North Bangkok
Prachinburi, Thailand [email protected]
Abstract—The poetry translation from original language to another is very different from general machine translator because the poem is written with prosody. Thai Poetry is composed with sets of syllables. Those rhymes, existing from stanzas, lines and the text in paragraph of the poetry, may not represent the complete syntax. This research is focus on Google and Bing machine translators and the tuning the prosody on syllable and rhyme. We have compared the errors (in percent), between the standard translators to those translators with tuning. The error rate of both translators before tune them, was at 97 % per rhyme. After tuning them, the percentage of errors decreased down to 60% per rhyme.. To evaluate the meaning, concerning the gained results of both kinds of translators, we use BLEU (Bilingual Evaluation Understudy) metric to compare between reference and candidate. BLEU score of Google is 0.287 and Bing is 0.215. We can conclude that machine translators cannot provide good result for translate Thai poetry. This research should be the initial point for a new kind of innovative machine translators to Thai poetry. Furthermore, it is a way to encourage Thai art created language to the global public as well.
Keywords-Thai poetry translation; translation evaluation; Poem machine translator
I. INTRODUCTION
Poetry is one of the fine arts in each country. The French poet Paul Valéry defined poetry as "a language within a language."[1]. Poetry can tell a story, communicate by sound and sight and can simply express feelings. Poetry translation from original language to other languages is the way to propagate the own culture to other countries in the world.
Machine translation of poetry is the challenge for researchers and developers [2]. According to Robert Frost’s definition, “Poetry is what gets lost in translation”. This statement could be considered, it’s very difficult to translate poetry from original language to other languages with original prosody. This is because poetry has specific syntax (prosody) in the different poetry type. They different in line-length (number of syllable), rhyme, meter and pattern. Many researches try to develop poetry machine translator to translate Chinese poetry, Italian poetry, Japanese (Hiku) poetry, Spanish poetry to English and translate back from English to original language such as Poetry of William Shakespeare. They were developing poetry machine translation based on a statistical machine translation technique.
For Thai Poetry and Thai Poet, Phra Sunthorn Vohara, known as Sunthorn Phu, (26 June 1786–1855) is Thailand’s best-known [3] royal poet. In 1986, the 200th anniversary of his birth, Sunthorn Phu was honored by UNESCO as a great world poet. His Phra Aphai Mani poems describe a fantastical world, where people of all races and religions live and interact together in harmony. But In Machine translation area, We never found the research about Thai poetry machine translation. Thai poetry has five major types are Klong, Chann, Khapp, Klonn and Raai.
In this paper we use the Thai prosody "Klon-Pad (Klon Suphap)" in order to translate to English. Klon-Pad has the rules of syllable, Line (Wak), Baat, Bot and relational rule of syllable in each Wak [4]. There are relations to beauty in content of creative writing and different for the prosody. Thai poetry has complexity of rhyme and syllable. Each line (Wak) of Thai poetry is not a complete sentence (SOV-Subject Object Verb). Furthermore, some Thai words can have several meanings while translated to English. These are the reasons why it is difficult to develop Thai poetry machine translator. Our studies are about two Bot of Klon8 Thai Poetry translate by two statistically machine translator which are Google Translator [5] and Bing Translator [6]. Then we tune the prosody using a dictionary and compare result of English poetry with Thai prosody in section 3. We use a case study from “Sakura, TaJ Mahal” [7] by Professor Srisurang Poolthupya to have a reference in evaluation by BLEU (Bilingual Evaluation Understudy) metric in section 4. Section 5 concludes this paper and points out the possible further works in this direction.
II. RELATED WORKS
Although we cannot find any research related to machine translator of Thai poetry to English, there are several research papers related to machine translation poetry from Chinese to English, Italian to English and French to English.
A. A Study of Computer Aided Poem Translation Appreciation [8]
This paper collecting three English versions of “Yellow Crane Tower” a poem of the Tang dynasty, applies the computational linguistic techniques available for a quantitative
The Eighth International Conference on Computing and Information Technology IC2IT 2012
64
analysis, and use BLEU metrics for automatic machine translation evaluation.
The conclusion of the currently available, computational linguistic technology is not capable of analyzing semantic calculation, which is, without a doubt, a severe drawback for poetry translation evaluation.
B. “Poetic” Statistical Machine Translation: Rhyme and Meter[9]
This is a paper of Google MT(Machine Translator) Lab. They use Google translator. Therefore they implement the ability to produce translations with meter and rhyme for phrase-based MT. They train a baseline phrase-based French-English system using WMT-09 corpora for training and evaluation, and use a proprietary pronunciation module to provide phonetic representation of English words. The evaluation use BLEU score.
The result of this research has the baseline BLEU score of 10.27. This baseline score is quite low and also has problem of system performance, it is still slow.
C. Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation[10]
This paper applies unsupervised learning to reveal word-stress patterns in a corpus of raw poetry and use these word-stress patterns, in addition to rhyme and discourse models, to generate English love poetry. Finally, they translate Italian poetry into English, choosing target realizations that conform to desired rhythmic patterns. In the section of poetry generation, FST (Finite State Transition) is used. However, this technology is having various problems, if the results have to be evaluated by humans. In part of poetry transition they use PBTM (Phrase base transition with meter). The advantage of poetry translation over generation is that the source text provides a coherent sequence of propositions and images, allowing the machine to focus on “how to say” instead of “what to say.”
III. OUR PROPOSED APPROACH
A. Methodology 1) Machine Translations
MT (Machine translation) is sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. MT has two major types. these are rule-base machine translation and Statistical Machine Translation Technology.
a) Rule-based machine translation: Relies on countless built-in linguistic rules and millions of bilingual dictionaries for each language pair. The rule-based machine translation includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms. A typical English sentence consists of two major parts: noun phrase (NP) and verb phrase (VP).
b) Statistical machine translation: based on bilingual text corpora. The statistical approach contrasts of the rule-base
approaches to machine translation as well as with example-based machine translation.
Both translators from Google and Bing use statistical machine translators. Moreover our team is using an API for Google and Bing translators to translate Thai poetry.
2) English Syllables Rule and Phonetics.
Syllables are very important in prosody of Thai Poetry. Each Wak has a rule for number of syllables. Relation between Wak and Bot has to check the sound of the syllable. Every syllable consists of a nucleus and an optional coda. It is the part of the syllable used in poetic rhyme, and the part that is lengthened or stressed when a person elongates or stresses a word in speech.
The simplest model of syllable structure [11] divides each syllable into an optional onset, an obligatory nucleus, and an optional coda. Figure 1 is showing the structure of syllable.
Figure 1. structure of syllable
Normally we can check the relation of rhymes, by checking them relation of the sound in the syllable. This is called Phonetics. It can help us, to get to know how to pronounce the word.
B. An Algorithm and Case Study
1) System Flowchart To study Thai Poetry in Machine Translation, we use
Thai poetry Klon-Pad 2 Bot (8 lines) as input to this process. Figure 2 is showing a system flowchart of this process.
Thai Poetry Translator
1. Language Translator
2. Poetry Checking
3. Poetry Prosody Tuning
Thai
Poetry
Poetry in Eng
with Tuning
Poetry
in Eng
Figure 2. system flowchart of Thai poetry in Machine Translation
The Eighth International Conference on Computing and Information Technology IC2IT 2012
65
In Figure 2, we design three modules to translate Thai poetry to English.
a) Language Translator: we use Google and Bing API Machine Translator to process Thai Poetry translates to English.
b) Poetry Checking: used to check prosody of poetry after translate to English. The result of this module is Thai poetry in English and error point of the proetry itself.
c) Poetry Prosody Tuning: after process module2 (Poetry Checking) we collect error points and tuning the poetry by using a dictionary and a thesaurus. An expected result of this module is the percentage of error will decrease.
Case Study: we process twenty Klon-Pad Thai poetry via three modules without reference in term of English translation by professional, and we process one of Thai poetry from “Sakura, TaJ Mahal” by Professor Srisurang Poolthupya as reference and use result from Google and Bing API as reference to calculate BLEU score.
We describe three modules in sub-section 2), 3) and 4) and in Figure 3 and Figure 4.
2) Language Translator Module This module process input Thai poetry (Klon-Pad) in
term of Thai language to translate to English by Google and Bing API Machine Translator. Figure 3 is showing a process of this module.
Figure 3. Language Translator Module
a) Case Study1,Original Thai Poetry: Thai Poetry “Deuan-chaai” from book “Oh ja o dòk máai oie”.
เปนพนธไมลมลกปลกแสนงาย ชองามเดน“เดอนฉาย”ใจถวล สเหลองบางแดงบางชางโศภน ดอกเฉดฉนสลางบานทงวน “เดอนฉาย”ใชจะแขงแสงเดอนสอง เพยงชอพองเพราะพรงสมจรงนน ยงไดงใสปยเดอนฉายยงพรายพรรณ เกนจะสรรเสกพราคาเยนยอ
b) Case Study1, Translate by Google API: A herbaceous plant species growing easy.
A good performance ‘in film’ I Tawil. Some red, some yellow Ospin technician.
Eฉidฉin flowers and clean the whole day. ‘Month Movie’ I will be racing in the light shines. Just to name a synonym is real nice there. The fertilizer plant in projecting the profile. Keeps up the quality scale flattery.
c) Case Study1, Translate by Bing API:
As the cultivation of plant species is very easy. Strong desire to make beautiful films a month name. Yellow and red are really sophin? Flores choetchin prominent pane all day. Last month, the race featured a Moonlight illuminates. Only the name Allied euphonic to life there. Even more sparkling variety fertilizer month projection We are too badly, the excessive praise.
3) Poetry Checking Module This module processes Thai poetry in English term
from Google and Bing API. We analyses syntax and collected error points for prosody of Klon-Pad Thai poetry in twenty poetries. Figure 4 is showing a process of the Poetry Checking Module.
Thai Poetry in Eng
Check Line-Length (Number of Syllable)
Check Rhyme (relation of phonetic)
Check words out of Vocabulary
Collect Error for
number of syllable
Collect error for rhyme
Collect error for word out
of vocab.
Thai Poetry in Eng with
Error marks
Poetry Checking
Figure 4. Poetry Checking Module
a) Check Line Length (Number of Syllable): Thai poetry has prosody for number of syllable in line. In each line are 7 to 9 syllables allowed. If one line is having more than 9 or less than 7 syllables, an error is implicated in the length of the line.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
66
From Case Study, Translate by Google API: we found 7 error lines as Table I below is showing.
TABLE I. AN EXAMPLE: THAI POERTY “DEUAN-CHAAI” TRANSLATEBY GOOGLE API.
Google Version Syllable Count
A herbaceous plant species growing easy. 11 A good performance ‘in film’ I Tawil. 10 Some red, some yellow Ospin technician. 10 Eฉidฉin flowers and clean the whole day. 9a ‘Month Movie’ I will be racing in the light shines. 12 Just to name a synonym is real nice there. 11 The fertilizer plant in projecting the profile. 13 Keeps up the quality scale flattery. 10
a. 9 syllables is not error in prosody for number of syllable in line
TABLE II. AN EXAMPLE: THAI POERTY “DEUAN-CHAAI” TRANSLATE BY BING API.
Bing Version Syllable Count
As the cultivation of plant species is very easy. 15 Strong desire to make beautiful films a month name. 12 Yellow and red are really sophin? 10 Flores choetchin prominent pane all day. 10 Last month, the race featured a Moonlight illuminates 13 Only the name Allied euphonic to life there. 13 Even more sparkling variety fertilizer month projection 17 We are too badly, the excessive praise. 10
Table I and II represents the numbers of syllables in each wak. While using Google API, this poem has only one wak, in which this number of syllables is correctly translated. When Bing API was used, not a single wak had the correct number of syllables, they are all error tagged.
b) Check Rhyme (Relation of Phonetic): Thai poetry has a rule for them Rhyme. For Klon-Pad we present the rule of Rhyme in figure 5.
Figure 5. Rhyme Prosody for Thai Poetry Klon-Pad (2 Bot)
Figure 5 show Thai poetry Klon-Pad Two Bot with 14 rules of Rhyme as flowing
• R1 relation of a1 and a2 or a1 and ax • R2 relation of b1 and b2 • R3 relation of b1 and b3 or b1 and bx • R4 relation of b2 and b3 or b2 and bx • R5 relation of b1, b2 and b3 or b1, b2 and bx • R6 relation of c1 and c2 or c1 and cx • R7 relation of d1 and d2 • R8 relation of d1 and d3 • R9 relation of d1 and d4 or d1 and dx • R10 relation of d2 and d3 • R11 relation of d2 and d4 or d2 and dx • R12 relation of d2, d3 and d4 or d2,d3 and dx • R13 relation of d3 and d4 or d3 and dx • R14 relation of d1, d2, d3 and d4 or d1, d2, d3 and dx
In this process, we check the relation of the syllables referred to the rule. A relation in Thai poetry means similar of pronunciations but it does not duplicate.
• Example 1: “today” relate with “may”, this is correct by rules of Rhyme.
• Example 2: “today” relate with “Monday”, this is error (duplicate) by rules of Rhyme.
• Example 3: “today” relate with “tonight”, this is error (not relate) by rules of Rhyme
• Case Study1, Translate by Google API:We found number of error 13 rule and correct in rule R3.
• Case Study1, Translate by Bing API:We found number of error 12 rule and correct in rule R1 and R3.
c) Check Words out of Vocabulary: We used a dictionary and thesaurus to check the meaning of these words. We found out that MT tried to translate those words by write them in term of phoneme. Actually those words might have a meaning in Thai language, but it is to complex to translate them from Thai to English in only one step. Many words should first be translated from Thai to Thai, before they can be sent to MT. Those words, MT was not able to translate, we will furthermore call in this paper: “Words out of Vocabulary”. Moreover, these words get error tagged.
• Case Study1, Translate by Google API:We found 3 words out of vocabulary. There are Tawil, Ospin and Eฉidฉin. Tawil means ‘to miss someone’ or ‘to think of someone’. Ospin means ‘beautiful’ and Eฉidฉin means ‘beautiful’.
• Case Study1, Translate by Bing API:We found 2 words out of vocabulary. There are sophin and choetchin. Sophin means ‘beautiful’ and choetchin menas ‘beautiful’.
4) Proetry Prosody Tuning. To study about basic tuning for Poetry translated by MT.
Therefore we use twenty poetries in MT to test to approach. Our basic approaches are:-
a) Word out of vocabulary: translate Thai to Thai before translate by MT.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
67
b) Number of Syllable Error: the majority of the occurred errors, are having more syllables as they are allowed to have. Then we used a dictionary and thesaurus to reduce the lenght of the sentences by the help of shorter words. Afterwards an omission of the articles like "a", "and" as well as "the" was an additional possibility to decrease the lenght.
c) Rhyme Error: we tune this error by the use of a dictionary and thesaurus to change words in Rhyme position. C. Measurment Design
In this paper we separate two majors kind of measurement.
1) Error percentage We process twenty Thai poetry by calculates them
prosody error percentage as shown in the equation below.
= % (1)
Equation (1): Es means syllable error percentage of Bot, Ps means number of syllable error and Ts means total of Wak (8 Waks) in Bot.
We calculate error percent of rhyme by equation (2) flowing.
= % (2)
Equation (2): Er means Rhyme error percentage of Bot , Pr means number of rhyme error and Tr means total Rhyme (14 rhyme position) in Bot.
We calculate the error percentage related to the wrong used words by the help of a vocabulary. See the following equation (3)
= % (3)
Equation (3): EW describes the percentage of vocabulary errors per Bot. In this context PW is the number of wrong words and TW the total number of words per Bot. Maximal 72 words could be possible.
Finally we calculate the average percentage of each error type for all twenty poetries. On this way we can create a summary to evaluate the results.
2) BLEU Score BLEU (Bilingual Evaluation Understudy) [12] is an
algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human. BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text. Equation of BLUE is showed in equation 4.
(4)
When Pn: Modified n-gram precision, Geometric mean of p1, p2,..pn.
BP: Brevity penalty (c=length of MT hypothesis (candidate) , r=length of reference)
(5)
In our baseline, we use N = 4 and uniform weights
wn=1/N
IV. EXPERIMENT RESULTS In our experiments we translated twenty poetries by two
machines translations which are Google API and Bing API. Both of MT is Statistical Machine Translation. In case study2 we use poetry from “Sakura, TaJ Mahal” by Professor Srisurang Poolthupya as reference and translate Thai poetry from this book by Google and Bing as candidates. This case study we evaluate by BLEU score. Finally, we summary the results as shown in the following part.
A. Result of Thai poetry in Google and Bing Translator.
In Table III, we show the percentage of errors from three error types before tuning those result. Most of these errors are mistakes of rhyme because MT is not able to understand poetry of rhyme and meter. In the column of “Tuning” the percentage of errors after tuning is shown in three parts.
TABLE III. PERCENT OF LINE-LENGTH ERROR, RHYME ERROR AND WORDS OUT OF VOCABULARY BEFORE TUNING AND AFTER TUNING
Items Translator
Google Google with Tuning
Bing Bing with Tuning
Total line 160 Line-length
(Number of syllable Error) 50 28 87 33
Percent of syllable Error 31% 18% 54% 21% Total Rhyme 280 Rhyme Error 271 158 272 147
Percent of Rhyme Error 97% 56% 97% 62% Total words 1440
Word Out of vocabulary 50 15 87 22 Percent of Word Out of
vocabulary 2% 1% 3% 2%
B. Case Study2: poetry from “Sakura, TaJ Mahal”and BLEU Evaluation.:
The original poetry in Thai and English show in table IV.
TABLE IV. THAI POETRY FROM BOOK “SAKURA, TAJ MAHAL”
Original Thai Poetry Reference: Translate by owner of poetry
ขอนอมเกศกราบครกลอนสนทรภ โปรดรบรศษยนขอนบไหว ทานโปรดชวยอานวยพรแตงกลอนใด องกฤษไทยของใหคลองตองกระบวน สอความหมายหลายหลากไมยากเยน ตรงประเดนเปรยบเทยบไดครบถวน จบใจผวจารณอานทงมวล ชวยชชวนใหผอนคลายสบายใจ
Sunthon Phu, the great Thai poet, I pay my respect to you, my guru. May you grant me the flow of rhyme, Both in Thai and in English, That I may express my thoughts, In a fluent and precise way, Pleasing the audience and critics, Inspiring peace and well-being
)logexp(1
∑=
•=N
nnn pwBPBLEU
rcifrcif
eBP cr ≤
>
= − )/1(
1
The Eighth International Conference on Computing and Information Technology IC2IT 2012
68
We use the original English Poetry as reference to compare them to both translators from Google and Bing. The calculated BLEU score is shown in Table V.
TABLE V. BLEU SCORE OF CANDIDATE FROM GOOGLE AND BING TRANSLATOR
Poetry BLEU Google I bow my head respectfully Soonthornphu
teachers. Please get to know us, this makes me respect. Please help with any poem. Thai English proficiently to process Various meanings can be very difficult. Relevant comparative information. Reading comprehension and critical mass. The prospectus provides a relaxed feel.
0.840
0.905
0.000 0.549 0.000 0.000 0.000 0.000
Average of BLEU Score 0.287 Bing We also ketkrap the teacher verse harmonious
Mussel Please recognize this request for a given by the audience. What a blessing you, help facilitate poem Fluent in English, Thai, tongkrabuan Describe the various not complicated Completely irrelevant comparisons. Catching someone reviews read all To help you relax, prospectus
0.840
0.000
0.309 0.574 0.000 0.000 0.000 0.000
Average of BLEU Score 0.215
V. CONCLUSION AND FUTURE WORK
The generated results show that these Machine Translators have many problems by translating Poetry. MT translates poetry without prosody. It is not able to understand Poetry pattern, difficult original words and sentences itself. The reason for that is the operating principle of MT itself. They use Phrase based methods while translating from the original to another language. But Thai Poetry can be written in incomplete sentences. Moreover, Thai words especially words in poetry are very complex. Some words should be translated from Thai to Thai before they can be sending to MT. The reason why poets use more difficult words is a matter of them felling, the beauty of these words and also the beauty of the poetry itself.
The result in this paper is show percent of error too high when we use only of MT to translate poetry especially Rhyme Error. Incidentally, it is possible to decrease the error rate down to 60% when tuning the results of MT. Moreover, the occurred errors of a backward translation from Thai to Thai could be decreased down between 1% and 2%, if the used words have been out of vocabulary.
The results of BLEU score. In this paper we use only 1 reference for evaluation. In case of BLEU, if we have many references, it is better than only a single reference. However it is very difficult to find reliable references for such an evaluation, except such verified English translations from the owner of the original Thai poetry itself.
This paper is the first research dealing about Machine Translation from Thai poetry to English. In the future, hopefully we are able to establish rules and poetry pattern to use those in combination with MT to translate Thai poetry to English with prosody keeping. The prosody and meaning of poetry are very important when translate to other languages because it can present arts and culture of that country.
ACKNOWLEDGMENT
This work is supporting poetries for translation by The Contemporary Poet Association and Professor Srisurang Poolthupya. Thanks also go to Google and Bing who are owner of famous Machine Translator.
REFERENCES
[1] Poetry, How the Language Really Works: The Fundamentals of Critical Reading and Effective Writing. [online], Available: http://www.criticalreading.com/ poetry.htm
[2] Ylva Mazetti, Poetry Is What Gets Lost In Translation, [online], Available: http://www.squidproject.net/pdf/ 09_Mazetti_Poetry.pdf
[3] P.E.N. International Thailand-Centre Under the Royal Patronage of H.M. The King. Anusorn Sunthorn Phu 200 years. Amarin printing. 2529. ISBN 974-87416-1-3
[4] Tumtavitikul, Apiluck. (2001). “Thai Poetry: A Metrical Analysis. Essays in Tai Linguistics”, M.R. Kalaya Tingsabadh and Arthur S. Abramson, eds. Bangkok: Chulalongkorn University, pp.29-40.
[5] Google Code, Google Translate API v2, [online], Available: http://code.google.com/apis/language/ translate/overview.html
[6] Bing Translator, online], Available: http://www.microsofttranslator.com/ [7] Srisurang Poolthupya, Sakura Taj mahal, Bangkok, Thailand, 2010, pp.
1-2. [8] Lixin Wang, , Dan Yang, Junguo Zhu, "A Study of Computer Aided
Poem Translation Appreciation", Second International Symposium on Knowledge Acquisition and Modeling, 2009.
[9] Dmitriy Genzel, Jakob Uszkoreit, Franz Och, ““Poetic” Statistical Machine Translation: Rhyme and Meter”, Google, Inc., Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, USA, 2010, pp. 158–166.
[10] Erica Greene, Tugba Bodrumlu, Kevin Knight “Automatic Analysis of Rhythmic poetry with Applications to Generation and Translation”, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, MIT, Massachusetts, USA, 9-11 October 2010, pp. 524–533.
[11] Syllable rule, [online], Available http://www.phonicsontheweb.com/syllables.php
[12] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA “BLEU: a Method for Automatic Evaluation of Machine Translation” Proceedings of the 40th Annual Meeting of the Association for, Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.
[13] L. Balasundararaman, S. Ishwar, S.K. Ravindranath, “Context Free Grammar for Natural Language Constructs an Implementation for Venpa Class of Tamil Poetry”, in Proceedings of Tamil Internet, India, 2003.
[14] Martin Tsan WONG and Andy Hon Wai CHUN, "Automatic Haiku Generation Using VSM", 7th WSEAS Int. Conf. on APPLIED COMPUTER & APPLIED COMPUTATIONAL SCIENCE (ACACOS '08), Hangzhou, China, April 6-8, 2008.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
69
Keyword Recommendation for Academic
Publications using Flexible N-gram
Rugpong Grachangpun1, Maleerat Sodanil
2
Faculty of Information Technology
King Mongkut’s University of Technology North Bangkok
Bangkok, Thailand
[email protected], [email protected]
2
Choochart Haruechaiyasak
Human Language Technology (HLT) Laboratory
National Electronics and Computer Technology Center
Pathumthani, Thailand
Abstract—This paper presents a method of annotate
keyword/keyphrase recommendation for academic literature.
The proposed method is flexible in order to generate flexible
lengths of phrases (flexible-gram of keywords/keyphrases) to
increase the chance of accurate and descriptive results. Several
techniques are applied such as parts of speech tagging (POS
Tagging), term co-occurrence which measures correlation-
coefficient and term frequency inverse document frequency (TF-
IDF) and finally weighting techniques. The results of the
experiment were found to be very interesting. Moreover,
comparisons against other algorithm keyword/keyphrase
extraction techniques were also investigated by the author.
Keywords-keywords recommendation; flexible N-gram;
information retrieval; POS Tagging
I. INTRODUCTION
At present, in the age of Information Technology, many academic literatures from many fields are published frequently and offered to readers via the Internet Network. Thus, searching for the desired document can sometimes be difficult due to the large volume of literature. If there were reliable ways for the information in these documents to be generated into accurate keywords and keyphrases to show the main idea or overall picture of the document it would be easier for readers to select the particular document they need. Keywords or keyphrases (multiple words that are combined) are a type of technique that is able to tell the readers quickly the basic roots of the document. Keywords/keyphrases do not only briefly tell readers the main idea of that document but also helps the people who professionally work with documents. For example, a librarian may take a long time to group enormous numbers of documents and arrange them on a shelf or in a data base. Thus, they may use keywords/keyphrases as a part of tool to classify those documents to groups.
Automatically extracting keywords/keyphrases from a document is challenging due to working with the natural language and some other things that we will cover later. Over the last decade, there have been several studies proposing
methods of keywords/keyphrases annotation. Moreover, extraction from different written media material such as WebPages, Political Records, etc. Several models have also been used to cope with these tasks such as Fuzzy logic, Neural Network models and others [1,2,3,6]. But these two models suffer from identically the same disadvantages as follows, lack of speed, hard to be reused and the accuracy is quite low when compared to statistical models. Statistical models are also used widely in this area, mathematical and statistical data or term property provides reliable and accurate results referring to frequency and location.
This study relies on not only statistical techniques to solve this problem but also combines a technique of Natural Language Processing called part-of-speech Tagging (POST) [4] in order to filter the answer set that is generated from our algorithm. The main objective of this experiment is to extract a set of keywords/keyphrases which are not firmly fixed in term of length from the results. This paper is arranged as follows, section II describes works that are related and have contributed to our experiment, section III describes the proposed framework, techniques and methodology which are used in this experiment, section IV focuses on the results and evaluation, and finally section V closes with a conclusion and future work.
II. RELATED WORKS
In this section, previous related works are described which were helpful to our experiment.
A. Term Frequency Inverse Document Frequency (TF-IDF)
Traditionally, TF-IDF is used to measure a term importance by focusing on frequency of the term which appears in a topical document and corpus. TF-IDF can be computed by (1) [5,7].
Herein, fre(p,d) is number of time phrase “p” occur in
document “d”; n is number of mentioned words in document “d”; dfp is number of document in our corpus that contains
TF-IDF = [fre(p,d)/n] ˟ [-log2 (dfp/N)] (1)
The Eighth International Conference on Computing and Information Technology IC2IT 2012
70
topical words; and N is the total number of documents in our corpus.
B. Correlation Coefficient (r)
Correlation Coefficient is a statistical technique which was used to measure the degree of relationship between a pair of objects or variables, here is a pair of terms. Correlation Coefficient can be represent as “R, r” and it can be computed by (2)
Where n is the total number of pair of word; x and y are number of element x and y respectively. The value of r can be in the range of -1 < 0 < 1. When r is close to 1, the relationship between those element is very tight. When r is positive, that means x and y have a linier proportionally positive relationship, for example, y is increased when x increases. When r is negative, that means x and y have a liner negative relationship. For example, x increases but y decreases. In the case of r is 0 or close to it, x and y are not related to the other.
Usually, in statistical theories, the relationship between two topical variables is strong when r is greater than 0.8 and weak while it is less than 0.5 [7]. In our experiment, correlation coefficient is taken to measure the degree of relationship in order to form a correct phrase. In section IV, the finest value of r is described and shows where it is applied to our experiment.
C. Part-of-Speech Tagging (POST)
Part-of-Speech Tagging (POST) is a technique or a process used in Natural Language Processing (NLP) [4]. POST is also called Grammatical Tagging or Word-Category Disambiguation [8]. The process is used to indentify word function such as Noun, Verb, Adjective, Adverb, Determiner, and so on [9]. Knowing word functions helps us to form an accurate phrase that’s generated by a machine so it is readable and understood by a human.
There are two different kinds of tagger [4]. The first is a stochastic tagger which uses statistical techniques and the second, is a rule-based tagger that focuses on peripheral words to find a tag and applies a word function for each term. Rule-based seems to be better than the other due to a word’s ability to have many functions which also affects the word’s meaning. In this study, rule-based tagging is applied to our experiment.
In this study, the algorithm extracts the part-of-speech pattern (POS pattern) by transferring all keywords from training documents with the POS Tagger, and then we collect that pattern and put it in our repository.
D. Performance Measurement
The performance is measured in three parameters, they are
Precision (P), Recall (R) and F score or Harmonic Mean.
Those parameters are widely used in the study of Information
Extraction. Precision tells us how well the algorithm found the
right answers. Recall tells us how well it picked the right
answers. And F score is used to measure the equilibrium of
Precision and Recall. All of the above can be calculated by the
following equation.
III. PROPOSED FRAMEWORK
In this experiment, the algorithm has two parts as follows
• Training phrase which creates the N-gram
language model and extracts the POS pattern.
• Testing phrase which extracts candidate phrases
from a new document and calculates the degree of
phrase importance.
FIGURE 1. Proposed framework
A. Preprocessing
Our experiment focuses on academic literatures. Thus, the source of raw documents must come from this area.
All the data that is used comes from the academic literatures downloaded from IEEE, SpringerLink. After the documents are collected, they must be transformed into “.txt” format which is the task of the process called Preprocessing.
Normally, the raw documents are in “.pdf” format. Those documents need to be converted into a “.txt” format with the reason of conveniently processing. Raw documents are actually sectioned into several major parts such as title, abstract, keyword and conclusion. Three of them are required, in this experiment, they are, title, abstract and conclusion. Thus, all words from those sections are collected.
Herein, the process of preprocessing, the two units are very similar but they only extract raw content text from the three sections, nothing more is done in this training phase.
B. N-Gram Process
In this paper, we focused on two major techniques. Unigram Extraction is a simple step to extract all words which do not appear in the stopword list. Secondly, the bi-gram list, this extracts all possible phrases which do not begin or end with stopwords. In each list, more additional fields are also
#Correct Extracted Words/Phrases
#Retrieved Words/Phrases
P = (3)
#Correct Extracted Words/Phrases
#Relevant Words/Phrases
R = (4)
2PR
P+R
F = (5)
r = n∑xy – (∑x)(∑y)
√n(∑x
2)-( ∑x)
2 √n(∑y
2)-( ∑y)
2 (2)
The Eighth International Conference on Computing and Information Technology IC2IT 2012
71
added in order to increase the speed of processing. They are the number of documents that contain words/phrases and the number of times that words/phrases occur in the corpus.
The criteria of tokenize possible words/phrases are
• Pair of words not mentioned as a phrase when
they are divided by a punctuation mark, those
marks are as follows “full stop, comma, colon,
semi-colon etc.”
• Digits of Number are ignored. Any number is
referred to as a stopword.
C. Candidate Phrases Extraction
All phrases from a converted document will be extracted using bi-grams. The bi-gram tokenization process is similar to the N-gram processing from the training phase but some of the criterion is different. The criterion focuses on ignoring words/phrases that appear only once [5], then it’s removed and replaced by a punctuation mark in order to tokenize the remaining phrase correctly. Tokenization across punctuation marks are not allowed.
The reason of tokenizing all phrases with bi-grams is that
most of the keywords/keyphrases are already composed of bi-
grams. Another reason is, if a phrase is tokenized as tri-grams,
it is inconvenient to foreshorten or lengthen when it is
tokenized in uni-gram. From our literature review, ratio of
uni-gram:bi-grams:tri-grams is 1:6:3. Table I shows an
example of tokenization.
TABLE I. EXAMPLE OF TOKENIZATION
Example
Original
Many universities and public libraries
use IR systems to provide access to
books, journals and other documents.
Web search engines are the most
visible IR application.
Uni-gram
universities, public, libraries,
IR, system, books, journals,
document, web, search, engines,
visible, IR and applications
Bi-gram
public library, IR systems, web search,
search engine, visible IR,
IR applications
D. Weight Calculation and Ranking
Weight calculation is used to score each phrase in our list which is called “Rank”. While the experiment was being conducted, (1) was applied to indentify words/phrases in the document but it did not generate well-enough results. Thus, (1) should be modified to gain a better end result.
In this experiment, there are two parameters added, those are Area Weight (AW) and Word Frequency (f). f is term/phrase frequency in a document. Thus, (1) is modified as (6).
Where, DI is degree of important of each phrase; (fre(p,d)) is frequency of phrase “p” in document “d”; dfp is number of documents that contain phrase “p” in corpus and AW is Integer mark assigned to each section of document and IDF stands for Inverse Document Frequency which able to computed by second term of (1).
AW is arbitrary assigned and adjusted until it generates the best result. The strategy of assigning the weighted score is to intuitively focus on both their physical and logical characteristics such as the size of space and the possibility of there being important information in each of the document sections. For instance, the space size of each area which is covered in this experiment is obvious different, Title is the smallest but, certainly, contains the most important information of a paper. Then, AW is then arbitrary assigned to each section until it generates the best result. The experiment provides the best result when Title, Abstract and Conclusion section are weighted at 7, 2 and 1 respectively.
All phrases are computed for their DI in each section and then, finally, average value is computed. For example, the phrase “information retrieval” occurs 1,3,2 times in each section of document DI which is shown in (7).
(7)
The digit “3”, at divisor, is the number of document section which are mentioned in this experiment.
E. Phrase Filtration
This process is about the filter before releasing the final result as keywords/keyphrase which is recommended by the algorithm. The result from the previous step may be lengthened as a tri-gram (the length is maximum in this experiment). As all phrases are computed in the previous processes into bi-grams, there might be some phrases which are not correct because there is a probability of a word missing. When a word is added in to these phrases, it should become more descriptive.
For example, an expected phrase is “natural language processing” but the phrases in our list are “natural language” and “language processing”. In this case, we may concatenate those phrases by focusing on “language” as a joint. In the example just mentioned, the proposed technique worked properly but may not for other pairs of phrases. The Correlation Coefficient which is a statistic technique was applied to solve this problem in order to concatenate two phrases which have an identical joint instead of [12].
After that, some of the improper phrases may still remain due to improper arrangement. Thus, POS patterns obtained in the previous process are applied. We need to compare functions of each word in each phrase that’s generated previously from our algorithm to patterns extracted from the corpus.
We also have to focus on the subset of a word function. For example, the phrase of “multiple compound sentences” has a pattern, POS, as “JJ, NN, NNS” (adjective, singular noun, plural noun) on the other hand, our algorithm generates
(fre(p,d))2
n DI = (6)
˟ IDF ˟ AW
DI = IDF (1
2˟7) + (3
2˟2) + (2
2˟1)
6˟3
The Eighth International Conference on Computing and Information Technology IC2IT 2012
72
“multiple compound sentence” which pattern is “JJ, NN, NN” (adjective, singular noun, singular noun) thus, those phrases are identical. In the case of a word function position is swapped, the phrase is discarded.
IV. EXPERIMENTAL RESULT
In our experiment, the algorithm behavior and its result were observed, the best value of r is found at 0.2. Which means if r of “natural” and “language” and also r of “language” and “processing”, from the example described above, is greater than 0.2, those phrase are concatenated as “natural language processing”. The preset of r value in this experiment being lower than the general statistic theory proposed in section II B, could be due to the data set in the corpus being scattered. For instance, the phrase “natural language” occurs 31 times from 18 documents while “natural” occurs 28 times from 27 documents and 203 times from 53 documents for the word “language”.
This algorithm was trained by 400 documents in the corpus and was applied to 30 academic literatures which were randomly downloaded from the same source of the set data used in the training phase. All literatures were also converted to “.txt” format before processing.
Our algorithm is measure at different amounts of extracted phrases, at 1-10, 15 and at best (best means all phrases in proposed list are mentioned, precision and recall is calculate from the last matched phrase in the proposed answer list). In Fig. 2, the performance measurement of Recall referring to both Correlation Coefficient (CC) and POS pattern (POS) are shown and compared to the Correlation Coefficient application.
Figure 2. Performance comparison with and without POST.
R_1, P_1 represents Recall and Precision when the Correlation Coefficient Technique is applied. R_2, P_2 represents Recall and Precision when the Correlation Coefficient and POST are applied.
Figure 3. Deep detail measurement of Precision and Recall
In this paper, the author also presents the best number of phrases that the algorithm should propose, maximize spot. The best number of phrases is calculated by the F score. Considering figure 3 and 4, the algorithm is suitable to propose no less than 5 phrases to end-user.
Figure 4. Maximize Spot
Finally, our algorithm is compared to a standard method of keyword extraction, TF-IDF, meaning that the degree of importance of each term was calculated by (1). The result is showed in table II.
TABLE II. PERFORMANCE COMPARISON
Average Performance (%)
Recall Precision F score
Standard TF-IDF 47.53 14.37 22.07
Proposed method 60.11 39.62 47.83
Table III and IV shows an example of keyphrases from the proposed method.
5 10 15 at best
R_1 29.26 42.41 47.57 59.28
P_1 20.77 15.77 12.05 34.19
R_2 40.97 49.82 56.14 60.11
P_2 28.46 18.46 14.36 39.62
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
Per
cen
tage
1 2 3 4 5 6 7
P 57.69 40.38 32.05 30.77 28.46 25.64 23.08
R 17.69 25.10 28.43 35.20 40.97 43.66 45.26
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
55.00
60.00
F m
easu
re
Precision and Recall
when CC and POS applied
1 2 3 4 5 6 7
F-Score 27.08 30.96 30.13 32.83 33.59 32.31 30.57
25.00
27.00
29.00
31.00
33.00
35.00
Per
cen
tage
Maximize Spot
The Eighth International Conference on Computing and Information Technology IC2IT 2012
73
TABLE III. EXAMPLE OF PROPOSED RESULT 1
Literature title
Concept Detection and Keyframe
Extraction
Using a Visual Thesaurus
Original
Keywords
Concept Detection, Keyframe Extraction,
Visual thesaurus, region types
top 7
Keywords
Proposed
in bi-gram
Visual Thesaurus
model vector
Concept Detection
region types
keyframe extraction
detection performance
vector representation
top 3
Keywords
Proposed
in tri-gram
Concept detection performance
shot detection scheme
exploiting laent relations
TABLE IV. EXAMPLE OF PROPOSED RESULT 2
Literature title An Improved Keyword Extraction Method
Using Graph Based Random Walk Model
Original
Keywords
keyword extraction, random walk model,
mutual information, term
Position, information gain
top 7
Keywords
Proposed
in bi-gram
Keyword Extraction
improved keyword
information gain
extraction method
method using
mutual information
extraction using
top 3
Keywords
Proposed
in tri-gram
Random walk model
mutual information gain
using inspect benchmark
V. DISCUSSION AND FUTURE WORK
This paper proposes an algorithm that’s able to extract phrases that match more than half of the original keyphrases which are assigned by the author, meaning the result is determined as acceptable. Moreover, it uses less training sets to
achieve this result at 47.83% of precision. Furthermore, this study focuses on applying this method to develop an application for real situations. Therefore, the proposed model is built as simple as possible. The only disadvantage of the corpus is that the data set is not clustered in a unique narrow dimension but a broad one. But, the broad dimension of the dataset makes training sets rather natural and close to ordinary language which is the biggest advantage.
Due to this experiment still in progress, there are some tasks that need to be revised. The author is planning to expand the size of corpus in term of the number of document in training set and to cover other fields of educational literature in order to observe a higher number of end results.
ACKNOWLEDGMENT
The authors would like to thank Asst.Prof. Dr.Supot Nitsuwat for sharing good ideas and his consultations. Dr.Gareth Clayton and Dr.Sageemas Na Wichian for statistical techniques and their experiences. Mr. Ian Barber for POST tool contribution and Acting Sub Lt. Nareudon Khajohnvudtitagoon for his development techniques. Last but not least, the faculty of Information Technology at King Mongkut’s University of Technology.
REFERENCES
[1] Md. R. Islam and Md. R. Islam, “An Improved Keyword Extraction Method Using Graph Based Random Walk Model,” 11th Int. Conference on Computer and Information Technology, pp. 225-299, 2008.
[2] Z. Qingguo and Z. Chengzh, “Automatic Chinese Keyword Extraction Based on KNN for Implicit Subject Extraction,” Int. Symposium on Knowledge Acquisition and Modeling, pp. 689-602, 2008.
[3] H. Liyanage and G.E.M.D.C.Bandara, “Macro-Clustering: improved imformation retrieval using fuzzy logic,” Proc. Of the 2004 IEEE Int.
Symposium on Intellignet Control, pp.413-418, 2004. [4] E. Brill, “A simple rule-based port of speech tagger,” Proc. of the third
conference on Applied natural language processings, pp. 152-155, 1992.
[5] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin and C. G. Nevill-Manning, “KEA: Practical Automatic Keyphrase Extraction”, Proc. of the fourth ACM conference on Digital libraries, ACM, 1999.
[6] K. Sarkar, M. Nasipuri and S. Ghose, “A new Apporach to keyphrase Extraction Using Neural Networks”, Int. Journal of Computer Science Issues vol.7, Issue 2, 2010.
[7] Mathbits.com, Correlation Coefficient. Available at: http://mathbits.com/mathbits/tisection/statistics2/correlation.htm
[8] WikiPedia, Part-of-Speech Tagging. Available at: http://en.wikipedia.org/wiki/Part-of-speech_tagging
[9] University of Pensylvania, “Alphabetical list of part-of-speech tags used in the Penn Treebank project. Available at: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
[10] X. Hu and B. Wu, “Automatic Keyword Extraction Using Linguisic Feature”, Sixth IEEE International Conference on Data Mining- Workshop (ICDMW ‘06), pp. 19-23, 2006.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
74
Using Example-based Machine Translation for English – Vietnamese Translation
Minh Quang Nguyen, Dang Hung Tran and Thi Anh Le Pham Software Engineering Department, Faculty of Information Technology Hanoi
National University of Education quangnm, hungtd, [email protected]
Abstract
Recently, there is a significant amount of advantages in Machine Translation in Vietnam. Most approaches are based on the combination between grammar analyzing and a statistic-based method or a rule-based method. However, their results are still far from the human expectation.In this paper, we introduce a new approach which uses the example-based machine translation approach. The idea of this method is that using an aligned pair of sentences which is in Vietnamese and English and an algorithm to retrieve the most similar English sentence to the input sentence from the data resource. Then, we make a translation from the sentence retrieved. We applied the method to English-Vietnamese translation using bilingual corpus including 6000 sentence pairs. The system approaches feasible translation ability and also achieves a good performance. Compare to other methods applied in English-Vietnamese translation, our method can get a higher translation quality.
I. Introduction Machine translation has been studied and
developed for many decades. For Vietnamese, there are some projects which proposed several approaches. Most approaches used a system based on analyzing and reflecting grammar structure (e.g. rule-based and corpora-based approaches). Among them, the rule-based approach is a trend of direction on this field nowadays; with bilingual corpus and grammatical rules built carefully [7].
One of the biggest difficulties in rule-based translation as well as other methods is data resources. An important resource that is required for translation is the thesaurus which needs lots of effort and work to build [9]. This dataset, however, do not meet the human’s requirements yet. In addition, almost traditional methods also require knowledge about languages applied so it takes time to build a system for new languages [5, 6]. The Example Based Machine Translation (EBMT) is a new method, which relies on large corpora and tries somewhat to reject traditional linguistic notions [5].
EBMT systems are attractive in that output translations should more sensitive to contexts than rule-based systems, i.e. of higher quality in appropriateness and idiomaticity. Moreover, it requires a minimum of prior knowledge beyond the corpora which makes the example set, and are therefore quickly adapted to many language pairs [5].
EBMT is applied successfully in Japanese and American in some specific fields [1]. In Japanese, they built a system achieving a high-quality translation and also an efficient processing in Travel Expression [1]. In Vietnamese, however, there’s no research following this method although the fact is that to apply in English-Vietnamese translation, this method doesn’t require too many resources and linguistic knowledge. We only have an English-Vietnamese Corpus data set in Ho Chi Minh National University – the significant data resource with 40.000 pair of sentences (in Vietnamese and English) and about 5.500.000 words [8].
We already have the English thesaurus and English- Vietnamese dictionary. About the set of aligned corpora, we have made 5.500 items for the research.
In this paper, we use EBMT knowledge to build a system for English-Vietnamese translation. We will apply graph based method [1] to Vietnamese language. In this kind of paradigm, we have a set, each item in this set is a pair of two sentences: one in the source language and one in the target language. From an input sentence, we carry out from the set a item which is the most similar sentence to the input. Finally, from the example and the input sentence, we adjust to provide a final sentence in target language. Unfortunately, we don’t have a Vietnamese, thesaurus so we proposed some solutions for this problem. In addition, this paper proposes a method to adapt the example sentence to provide the final translation.
1. EBMT overview: There are 3 components in a conventional example based system:
- Matching Fragment Component. - Word Alignment. - Combination between the input and the example sentence carried out to provide the final target sentence.
For example:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
75
(1) He buys a book on international politics (2) a. He buys a notebook.
b. Anh ấy mua một quyển sách. (3) Anh ấy mua một quyển sách về chính trị thế giới.
With the input sentence (1), the translation (3) can be provided by matching and adapting from (2a, b).
One of the main advantages of this method is that, we can improve the quality of translation easily by widen the amount example set. The more items add, the better we have. It’s useful to apply for a specific field because the limit of form of the sentence included in these fields. For example, we use it to translate manuals of product, or weather forecast, or medical diagnosis.
The difficulty to apply EBMT in Vietnam is that, there’s no word-net in Vietnamese, so we promote some new solutions to this problem.
We build a system with 3 steps: - Form the set of example sentences, the result is the set of graphs. - Carry out the most popular example sentence to the input sentence. From an input sentence, using “edit distance” measuring, the system will find sentences which are the most similar to it. Edit- distance is used for fast approximate between sentences, the smaller distance, and the greater similarity between sentences.- Adjust the gap between the example and the input.
2. Data resource: We use 3 resources of data. That is:
- Bilingual corpus: this is the set of example sentences. This set includes pairs of sentences. Each sentence is performed as a word sequence. Spreading the size of the set will improve the quality of translation.
- The Thesaurus: A list of words showing similarities, differences, dependencies, and other relationships to each other.
- Bilingual Dictionary: We used the popular English Vietnamese dictionary file provided by Socbay Company.
3. Build the graph of example set. The sentences are word sequences. We divide the
words into 2 groups - Functional word: Functional words (or grammatical
words or auto-semantic words) are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker.
- Content word: Words that are not function words are called content words (or open class words or lexical words): these include nouns, verbs, adjectives, and most adverbs, although some adverbs are function words (e.g., then and why).
We classify the set into sub set. Each set includes sentences with the equal amount of content words and functional words.
Based on the division, we build a group of graphs – word graphs: - They are directed graphs including start node and goal node. They includes nodes and edges, an edge is labeled with a word. In addition, each edge has its own source node and destination node. - Each graph performs a sub set. Each sub set includes sentences with the same total of content word and the same total of functional word. - Each path from start node to goal node performs a candidate sentence. To optimize the system, we have to minimize the size of word graph. Common word sequences in different sentences use the same edge.
Figure 1: Example of Word Graph
The word graphs have to be optimized with the minimum number of node. We use the method of converting finite state automata [3, 4]. After preparing all resources for this method, we will execute 2 steps of it: example retrieval and adaption:
4. Example retrieval: We use the A*Search algorithm to approach the
most similar sentences from word graph. The result of matching between two word sequences is a sequence of substitution, deletions and insertions. The search process in a word graph is to find a least distance between the input sentence and all the candidates perform in graph.
As a result, matching sequences of path are approached as records which include a label and one or two words.
Exact match: E(word) Substitution: S(word, word) Deletion: D(word) Insertion: I (word)
For example: Matching sequence between the input sentence We close the door and the example She closes the teary eyes is:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
76
S(“She”, “We”) – E(“close”) – E(“the”) – I(“teary”) – S(“eyes”).
The problem here is that we have to pick a sentence with the least distance to the input sentence. We firstly compare the total of E- records in each matching sequence, and then we compare S-records and so on.
5. Adaption: From the example approached, we adapt it to provide
the final sentence in target language for input sentence by insertions, substitutions and deletions. To find the meaning of English words, we used morphological method.
5.1. Substitution, deletion and exaction: We will find the right position for the word in the
final sentence for substitution, deletion and exaction. With deletion, we do nothing, but the problem here is that we have to find to meaning of word in substitution and deletion records. - There are some different meanings of a word, which one will be chosen? - Words in the dictionary are all in infinitive form while they can change to many other forms in the input sentence.
We help to solve this problem carefully. Firstly, we find the type of word (noun, verb, adverb…) in the sentence. We use Penn Tree Bank tagging system to specify the form of each word. Secondly, based on the form of word, we seek the word in the dictionary: If the word is plural (NNS): - If it ends with “CHES”, we try to delete “ES” and “CHES”, when the deletion makes an infinitive verb; we find the meaning in dictionary. Other case, it is specific noun. - If it ends with “XES” or “OES”, we delete “XES” or “OES” and find the meaning. - If it ends with “IES”: replace “IES” by “Y”. - If it ends with “VES”: replace “VES” by “F” or “FE”. - If it ends with “ES”: replace “ES” by “S”. - If it ends with “S”: delete “S”.
After finding the meaning of plural, we add “những” before its meaning. If the word is gerund: - Delete “ING” at the end of the word. We try two cases. First is the word without “ING” and second is the word without “ING” and with “IE” at the end. If the word is VBP: - If the word is “IS”: it’s “TO BE”. If it ends with “IES”: replace “IES” by “Y” - If it ends with “SSES”: erase “ES” - If it ends with “S”: erase “S” If the word is in the past participle or past form:
- Check the word if it is included in the list of irregular verb or not. If it’s included, we use the infinitive form to find the meaning. The list of irregular verb is performing as red-black tree to make the search easier and faster. - If it ends with “IED”: erase “IED”. - If it ends with “ED”: check the very last 2 letter before “ED”, if they are identical then we erase 3 last letter of word. Otherwise, we erase “ED”. If the word is in present continuous form, we find the word in the same way with gerunds. After that we add “đang” after the meaning. - If the word is JJS: Delete 3 and 4 last 4 consonant and find the meaning in the dictionary. After infinitive form of word is found, we use bilingual dictionary to seek the meaning. The problem is that, when we reach the infinitive form of word, since there are many meanings with a kind of words, we have to choose the right one. In our experiment, we take the first meaning in the bilingual dictionary.
5.2. Insertion: The problem here is that we don’t know the exact
position to fill the Vietnamese meaning. If we choose the position as the position of Insertion record in matching sequence, the final sentence in Vietnamese will be in low quality. We have to use the theory of ruled-based machine translation to solve this problem. We can use it in some specific phrase to find the better position instead of the order of records.
Firstly, link grammar system will parse the grammatical structure of sentence. The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).
From the grammatical structure of sentence, we find out some phrases in English which need to change the order of word to translate into Vietnamese. For example, the noun phrase “nice book”, with 2 I-records: I(nice) and I(book), we used to translate into “hay quyển sách” instead of “quyển sách hay”. With link grammar, we know the exact order to translate. Some phrases to process:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
77
Length of sentence (words)
0 – 5 5 – 10 10 - 15 15 - 30
Threshold 2 3 4 6
5.3. Example:
Input sentence: This nice book has been bought Example retrieval: the most similar example with input sentence found out is This computer has been bought by him. Sequence of records: E(“This”) – I(“nice”) – S(“computer”, “book”) – E(“has”) – E(“been”) – E(“bought”) – D(“by”) – D(“him”).
With link grammar, there a noun phrase within the sentence “This nice book”, with the records E(“This”), I(“nice”), S(“computer”, “book”) respectively. We reorder the sequence: S(“computer”, “book”) – I(“nice”) – E(“This”) – E(“has”) – E(“been”) – E(“bought”) – D(“by”) – D(“him”).
Based on new records sequence and the example, the adaption phase will be processed: - Exact Match: Keep the order and the meaning of word. “” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Substitution: Find the meaning of word in input sentence, replace the word in example by it. “Quyển sách” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Deletion: Just erase the word in example. “Quyển sách” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Insertion: We now have the right order of record, so we just finding the meaning of word in Insertion record and put it in order of the record in sequence. “Quyển sách” – “hay” – “” – “này” – “được” – “mua” – “” – “” – “”.
After 4 steps of adaption, we provide the final sentence: “Quyển sách hay này được mua”
6. Evaluation:
6.1. Experimental Condition:
We made manually an English-Vietnamese corpus including 6000 pairs of sentences.
To evaluate translation quality, we employed subjective measure.
Each translation result was graded into one of four ranks by bilingual human translator who is native speaker in Vietnamese. The four ranks were:
A: Perfect, no problem with both grammar and information. Translation quality is nearly equal to human translator.
B: Fair. The translation is easy to understand but some grammatical mistake or missing some trivial information.
C: Acceptable. The translation is broken but able to understand with effort.
D: Nonsense: Important information was translated incorrectly.
The English - Vietnamese dictionary used includes 70,000 words. To optimize the processing time, a threshold is used to limit the result set of Example retrieval phase. Table 2 show the threshold we used to optimize example retrieval phase with sentence’s length smaller than 30. If length of input sentence is greater than or equal to 30, threshold is 8. Table 2: Value of threshold
1 Noun phrase: POS(1, 2) = (JJ, NN). Reorder: (NN, JJ)
2 Noun phrase: POS(1, 2, 3) = (DT, JJ, NN) && word1 = this, that, these, those. Reorder: (NN, JJ, DT)
3 Noun phrase: POS(1, 2) = (NN1, NN2). Reorder: (NN2, NN1) 4 Noun phrase: POS(1, 2) = (PRP$, NN). Reorder: (NN, PRP$)
5 Noun phrase: POS(1, 2, 3) = (JJ1, JJ2, NN). Reorder: (NN, JJ2, JJ1)
Table 1: Some phrase to process with Link Grammar
The Eighth International Conference on Computing and Information Technology IC2IT 2012
78
Rank A B C DTotal 25 11 3 11 Average length of sentence
9.3 6.3 7.8 8.4
Rank A B C D Total 15 10 3 22 Average length of sentence
5.7 5.6 6.0 8
6.2. Performance: For the experiment, we create two test sets: a test set of random sentence with complex grammatical structure and a set of 50 sentences edited from the training set. Under these conditions, the average processing time is less than 0.5 second for each providing each translation. Although the processing time increases as the corpus size increases, the increasing scale is not linear but about a half power of corpus size. Compare to DP-matching [2], the method used to retrieve example with word graph and A*Search achieves efficient processing. Using the threshold 0.2 with random sentences, where the time processing is significantly decreased, and the translation quality is low. The reason is that we used the bilingual corpus with size is too small. As a result, examples approached are not similar enough to the input sentence. There are two ways to increase translation quality. Firstly, we widen the size of example set. Secondly, since we have not any appropriate way to choose the right meaning from bilingual dictionary, we apply the context-based translation to EBMT system. The tabe 3 and the table 4 illustrate the evaluation of result.
Table 3: Set of edited sentences and performance
Precision: 70%
Table 4: Set of random sentences and performance
Precision: 50% For the set of edited sentences (table 3), the system
reached high translation quality with the Precision of 70%. Items in this set have grammar structure and word type similar to example set, this makes EBMT system find the suitable sentences to translate.
For the set of random sentences (table 4), because it contains a number of complex sentences, so that the examples EBMT system reach is not similar enough to the input, consequently, the result has low quality (Only 50% sentences translated with quality rank of A or B, the rest is at rank C or D).
System can translate sentences with complex grammatical structure as long as the example approached is similar enough to the input sentences. 7. Conclusion:
We report on a retrieval method for an EBMT system using edit-distance and evaluation of its performance using our corpus. In experiments for performance evaluation, we used bilingual corpus comprising 6000 sentences from every field. The reasons cause some low quality translation is the small size of bilingual corpus. The EBMT system will provide a better performance when it runs into a specific field. For example, we use EBMT to translate manuals of productions, or introductions in travel field. Experiment results show that the EBMT system achieved feasible translation ability, and also achieved effort processing by using the proposed retrieval method.
Acknowledgements:
The author’s heartfelt thanks go to Professor Thu Huong Nguyen, Computer Science Department, Hanoi University of Science and Technology for supporting the project, Socbay linguistic specialists for providing resources and helping us to test the system.
Reference [1] Takao Doi, Hirofumi Yamamoto and Eiichiro
Sumita. 2005. Graph-based Retrieval for Example-based Machine Translation Using Edit-distance.
[2] Eiichiro Sumita, 2001. Example-based machine translation using DP-matching between word sequences.
[3] John Edward Hopcroft and Jeffrey Ullman, 1979. Introduction to Automata Theory, Languages and Computation. Addison - Wesley, Reading, MA.
[4] Janusz Antoni Brzozowski, Canonical regular expressions and minimal state graphs for definite events, Mathematical Theory of Automata, 1962, MRI Symposia Series, Polytechnic Press, Polytechnic Institute of Brooklyn, NY, 12, 529–561.
[5] Steven S Ngai and Randy B Gullett, 2002. Example-Based Machine Translation: An Investigation.
[6] Ralf Brown, 1996. “Example-Based Machine Translation in the PanGloss System.” In Proceedings of the Sixteenth International Conference on Computational Linguistics, Page 169-174, Copenhagen, Denmark.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
79
[7] Michael Carl. 1999: “Inducing Translation Templates for Example-Based Machine Translation,” In the Proceeding of MT-Summit VII, Singapore.
[8] Đinh Điền, 2002, Building a training corpus for word sense disambiguation in English-to- Vietnamese Machine Translation.
[9] Chunyu Kit, Haihua Pan and Jonathan J.Webste., 1994. Example-Based Machine Translation: A New Paradigm.
[10] Kenji Imamura, Hideo Okuma, Taro Watanabe, and Eiichiro Sumita, 2004. Example-based Machine Translation Based on Syntactic Transfer with Statistical Models.
1994. Example-Based Machine Translation: A New Paradigm.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
80
Cross-Ratio Analysis for Building up The Robustness of Document Image Watermark
Wiyada Yawai
Department of Computer Science, Faculty of Science King Mongkut’s Institute of Technology Ladkrabang
Bangkok, Thailand E-mail: [email protected]
Nualsawat Hiransakolwong
Department of Computer Science, Faculty of Science King Mongkut’s Institute of Technology Ladkrabang
Bangkok, Thailand E-mail: [email protected]
Abstract—This research presents the applied cross-ratio theory effectively used for building up the robustness of invisible watermarks which embedded in multi-language; English, Thai, Chinese, and Arabic, grayscale document image against the geometric distortion attacks; scaling, rotating, shearing and other manipulating, like noise signal adding, compression, sharpness, blur, brightness and contrast adjusting, that occurs while scanning the embedded watermarks for verifying them. These attacks are simulated to test the effectiveness of cross-ratio theory initiatively used to enhance this mentioned watermark robustness for any document image of any language which normally is the main limitation of other watermarking methods. This theory is using 4 corners and two diagonals of a document image as the reference for watermark embedding lines located between text lines crossing against two diagonals and vertical lines of both sides according to specified cross-ratio values. For watermark embedding positions on each line will be calculated from another set of cross-ratio. Cross-ratio values of each line will be different in accordance with preset patterns. Detection of watermarks in document images is not necessary to be converted image or compared with original image. Our approach can be detected through calculation of referred 4 corners of the image and applied correlation coefficient equation to directly compare against original watermarks. Testing revealed that it could build up reasonably robustness against scaling, at range 11% up, shearing ( 0 - 0.05), rotating (1 – 4 degrees), compressing, range 60% up, contrasting (1 – 45%), sharpness (0 – 100%) and blur filtering at mask size less than 13x13.
Keywords-Digital watermarking; Document image; Robustness; Geometric distortion; Cross ratio; Collinear points
I. INTRODUCTION
Digital watermarking is one of the processes of hiding data for protecting copyright of digital media either in forms of audio, video, text, etc. There are two categories of watermarking; visible watermarking and invisible watermarking. The major purpose of watermarking is to protect copyright of media through creating various forms of obstructions to violators.
This research is particularly emphasized on applying the cross ratio theory to create the robust watermark data embedded in the grayscale document image which must be
survived and easily detected even it has been attacked in many possible ways, especially geometric distortion attacks which mostly not been explored in other document image watermarking researches. Most existing researches are focused on watermarking an electronic text or document file, instead of document image, of one specified language, instead of multi-language, and emphasized the watermark embedding technique instead of watermark robustness. These existing document watermarking researches can be categorized, by their watermarking technique, into 3 techniques as follows.
Technique I: Watermark embedding with text document physical layout/pattern/structure rearrangement, such as shifting of lines [1] and words, particularly binding of word spacing, word shift coding or word classification [2][3][4][5]. This technique can be applied to both watermark electronic document file and document image. However, it has some disadvantages, for instance, line shifting technique of Brassil et will be low robust to document passing through document processing, page skewing/rotation; between -3o - +3 o, noise signal adding attack and a short text line. Another limitation of this process is that it can only apply to document with spaces between words, spacing of letters, shifting of baselines or line shift coding. Word shifting algorithm has also developed by Huang et al. [5], it’s based on adjusting inter-word spaces in a text document so that mean spaces across different lines show characteristics of a sine wave where information or watermark can be encoded or embedded in the sine wave(s) for both horizontal and vertical directions.
Min Du et al. [6] proposed a text digital watermarking algorithm based on human visual redundancy. According to that the human eye is not sensitive to the slight change for text color, watermarks were embedded by changing the low-4 bits of RGB color components of characters. This proposed method has good invisibility and robustness which depending on its redundancy. However, this research tested its robustness against word deleting and modifying only.
Technique II: Embedding text watermark by text character/letter feature modifying, for example, Brassil, et a l. [2] have used the letter adjusting by reducing or increasing length of letter; such as, increasing length of letters b, d, or h. For principle applying to extract hidden data out of document can be done by comparing hidden data document against original document. The limitation of this process is that the
The Eighth International Conference on Computing and Information Technology IC2IT 2012
81
hidden data will be so little robust to document passing through document processing.
Applying arithmetic expression to replace characteristic of letter with close component has developed by W. Zhang et al.[7] which has applied arithmetic expression to replace characteristic of letter with close component (in square form) which the hiding is done by adjusting sizes of those characters in document file. This process is robust against attacks or destruction and unable to observe. The test referred that the mentioned hiding is durable and more difficult to observe than those of the line-shift coding, word-shift coding, and character coding but has not presented the robustness testing information in the research document. However, this research is only be applied to Chinese characters and subject to be further researched since the process has tested only the character replacement attack, not yet against durability to various forms of watermark attacks.
Shirali-Shahreza et al. [8] has applied changing of characters in Persian that a number most characters have their distinction in their peaks (Persian letter NOON) of these characters for hiding. Due to defects in using OCR in reading Persian and Arabic document image; therefore, reading of printed text from these characters for extracting hidden data is considered complicate, especially after attacking that has not yet tested.
Suganya et al. [9] proposed to modify perceptually significant portions of an image to make the algorithm that watermark is hidden in the point’s location of the English letter i and j. first few bits are used to indicate the length of the hidden bits to be stored. Then the cover medium text is scanned to store a one, the point is slightly shifted up else it remains unchanged. However, this research did not refer to any robustness testing result.
Technique III: Watermarking with semantic schemes or word/vowel substitution; Topkara et al. [10] has developed a technique for embedding secret data without changing the meaning of the text by replacing words in the text by synonyms. This method deteriorates the quality of the document and a large synonym dictionary is needed.
Samphaiboon et al. [11] proposed a steganographic scheme for electronic Thai plain text documents that exploits redundancies in the way particular vowel, diacritical, and tonal symbols are composed in TIS-620, the standard Thai character set. The scheme is blind in that the original carrier text is not required for decoding. However, it can be used with only Thai text document and its watermark data is so easy to be destroyed by reediting with a word processing program.
The following presenting research has been studied on text document image (not electronic document file), scanned or copied from an original document paper, watermarking by applying the cross-ratio theory in collinear point type, in order to build up its watermark robustness against various forms of attacks, particularly geometric distortions; scaling, shearing and rotation and other manipulations; data compression, noise signal adding and brightness, contrast, scale, sharpness and blur adjusting.
II. THE CROSS RATIO OF FOUR COLLINEAR POINTS
The cross-ratio is a basic invariance in projective geometry (i.e., all other projective invariance can be derived from it). Here brief introduction to the cross-ratio invariance property is given.
Let A, B, C, D be four collinear points (Three or more points A, B, C,… are said to be collinear if they lie on a single straight line[12]) as shown in Fig. 1. The cross-ratio is defined as the “double ratio” in Eq. (1)
ADBCBDACDCBA⋅⋅
=),;,( (1)
where all the segments are thought to be signed. The cross-ratio does not depend on the selected direction of the line ABCD, but does depend on the relative position of the points and the order in which they are listed. Based on a fundamental theory, any homography preserves the cross-ratio. Thus central projection, linear scaling, skewing, rotation, and translation preserve the cross-ratio [13].
Figure 1. Collinear points A, B, C, and D
III. ANALYSIS OF DIGITAL WATERMARKING FOR DOCUMENT IMAGE
To apply the cross ratio to digital image watermarking,
three reference points are required. In this section, a method for deriving such reference points is detailed.
A. Definition
baCC is line from an origin point aC to a destination
point bC where =a 1, 2, 3,4 and =b 1, 2, 3, 4
Cr = (CA/CD) : BA/BD = (CA/CD) (BD/BA)
R = (AC/CD)/Cr BA = (AD * R) / (1 + R) DsB is distance of BA/AD is equal to the value of BA/DA
B. Embedding Scheme
Let’s start by considering the embedding part. The method is described algorithmically below.
1) Predefine the set of cross-ratio values, to be used in subsequent steps.
2) Find the image center, as denoted by cD , by using the line intersection formula [14] (two diagonal lines of the image) as described by Eqs.(2) ~ (3) below.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
82
btc xxx /= (2)
btc yyy /= (3)
23
41
xxb
xxaxt −
−=
2323
4141
yyxx
yyxxxb −−
−−=
23
41
yyb
yyayt −
−=
2323
4141
yyxx
yyxxyb −−
−−=
where a= 44
11
yxyx
, b = 22
33
yxyx
. In addition, ),( ii yx is
the coordinate of the point 4,...,1, =iCi (see Fig. 2).
From the above equations, cx is the x-axis value of the
point cD of two-line intersection; 41CC intersect 32CC , and
cy is the y-axis value of the same point and denotes a determinant operator as shown in Fig. 2.
3) Find each of the primary-level watermark embedding
points ( iLUD , and iLDD , ) on the left diagonal line (see Fig. 2)
as described by Eqs.(4) ~ (7) below. Those points can be
identified by using two corner points of the left diagonal line
(C1 and C4), in combination with the image center point cD as
shown in Fig. 2(a) and the predefined cross-ratio values ( rC )
)( 141, xxDsBxx iLU −×+= (4)
)( 141, yyDsByy iLU −×+= (5)
)( 141, xxDsBxx iLD −×+= (6)
)( 141, yyDsByy iLD −×+= (7)
where ( iLUx , , iLUy , ), i = 1, …. LUM , is the coordinate of the point iLUD , , A = C1, B = iLUD , , C= cD and D = C4 . In addition, ( iLDx , , iLDy , ), i = 1, …. LDM , is the coordinate of the point iLDD , , A = C1, B = iLDD , , C= cD and D = C4 .
4) Find each of the watermark embedding points ( iRUD , and iRDD , ) on the right diagonal line (see Fig. 2(b)) by following the steps and equation similar to those detailed in Step 3. However, now the point A in Eqs. (8) ~ (11) represents the point C2 while the point B now represents the point C3. By using these substitutions, those embedding points are given by
)( 232, xxDsBxx iRU −×+= (8)
)( 232, yyDsByy iRU −×+= (9)
)( 232, xxDsBxx iRD −×+= (10)
)( 232, yyDsByy iRD −×+= (11)
where ( iRUx , , iRUy , ), i = 1, …. RUM , is the coordinate of the point iRUD , , A = C2, B = iRUD , , C= cD and D = C3. In addition, ( iRDx , , iRDy , ), i = 1, …. RDM , is the coordinate of the point iRDD , , A = C2, B = iRDD , , C= cD and D = C3.
(a)
(b)
Figure 2. Notations of collinear points A, B, C, and D, defined in cross-ratio
equation, on the left (a) and right (b) diagonal line of the document image.
5) For each pair of iLUD , , iRUD , , and iLDD , , iRDD , levels, find an intersection, ii yx , ,of crossed line of each level drawn across left side; 1,1, ... LDLU LL and right side;
1,1, ... RDRU LL of document image borders (see Fig. 3(a)); 31CC and 42CC by applying Eqs. (12) ~ (13)
The Eighth International Conference on Computing and Information Technology IC2IT 2012
83
bti xxx /= (12)
bti yyy /= (13)
43
21
xxbxxa
xt −−
=4343
2121
yyxxyyxx
xb −−−−
=
43
21
yybyya
yt −−
=4343
2121
yyxxyyxx
yb −−−−
=
where a = 22
11
yxyx , b=
44
33
yxyx
6) Find each of the watermark embedding points ( kiHUE ,, and kiHDE ,, ) on the watermarked embedding lines (see Fig. 3(b)) Eqs. (14) ~ (17) represents the embedding points ( kiHUE ,, and kiHDE ,, ) are given by
)( ,,,,, iLUiRUiLUkiHU xxDsBxx −×+= (14)
)( ,,,,, iLUiRUiLUkiHU yyDsByy −×+= (15)
)( ,,,,, iRDiRDiLDkiHD xxDsBxx −×+= (16)
)( ,,,,, iRDiRDiLDkiHD yyDsByy −×+= (17)
where ( iLUx , , iLUy , ), i = 1, …. LUM , A = iLUL , ,
B = kiHUE ,, , C = iRUD , and D = iRUL , . In addition,
( iLDx , , iLDy , ), i = 1, …. LDM , A = iLDL , , B = kiHDE ,, , C
= iRDD , and D = iRDL , .
7) From all watermark embedding points, embed the
watermark patterns by means of a spread-spectrum principle [15] using the following equations
Given the set of watermark embedding points Ek = (xk, yk), k = 1, …M, and each of the watermarking pattern bits wk, wk∈ 1,-1, k = 1…. M, each watermarking pattern bit is embedded to the original image by using the following Eq. (18)
),( kn
kme yxI
= kkn
km wyxI α+),( (18)
where mxx kkm += , m = -P, . . . , P, nyy k
kn += , n = -Q, . . . ,
Q and α = strength of watermark.
(a)
(b)
Figure 3. (a) Notations of horizontal lines intersect with 2 diagonal lines and left and right border lines in text document image. (b) Notations of collinear points used for embedding 20 invisible actual watermark pattern bits in the
document image of English, Thai, Chinese, and Arabic languages.
C. Detection Scheme To detect a watermark from the document image '
eI , the four image corner points must first be detected. This can be achieved, for example, by using any of the existing corner detection algorithms. Once the four corner points are detected watermark embedding points must be identified. Each point
The Eighth International Conference on Computing and Information Technology IC2IT 2012
84
can be calculated by using the method similar to that of the embedding stage (see Section A. for details). By extracting the values of the pixels corresponding to those watermark embedding points, denoted by ),('
kke yxI , a watermark can be detected by using any of the existing watermark detectors. Here, we adopt the correlation coefficient detector [15]. The correlation coefficient value is computed by the following equation.
)~~)(~~(
)~~(),( '
kkee
kekecc
WWII
WIwIZ××
×= (19)
where kkkeee WWWIII −=−=~,~ '
Watermark is detected if the correlation coefficient value is
greater than a detection threshold. For example, in the experiment that follows, the detection threshold is 0.5.
IV. EXPERIMENT
Under this computer simulation experiment, 35 grayscale multi-language document images, size of 1240x1754 pixels, were used to add 20 different invisible actual watermark patterns of length 100 bits, α is 3, block size of watermark 5x5 pixels/watermark pattern bit. The cross-ratio values used for watermark embedding and detecting were 120 values.
The results of experiment for digital embedding in 35 grayscale document images comprising of the images with text in English, Thai, Chinese, and Arabic applied with 20 various watermark patterns (see Fig. 4), through measuring of watermark values from correlation coefficient by fixing threshold value equal to 0.5 (if there are watermarks in text document images with threshold value from 0.5 onward, if there is no watermarks its value must be less than 0.5) revealed the reasonable watermark robustness enhancement of the cross-ratio applying.
Figure 4. Show some examples of 20 random invisible actual watermark pattern bits which were created and embedded between each text line of
English, Thai, Chinese, and Arabic text document images.
Firstly, tested the controlled document image without watermark by comparing with watermark pattern could obtain value of correlation coefficient = 0, while the image with watermark pattern obtained value of correlation coefficient = 1.
After that testing the cross-ratio watermarking robustness with 9 attacks, including 3 geometric distortions; shearing, scaling and rotating and 6 manipulations; compression, sharpness, brightness, contrast, blur masking and noise signal adding. The actual watermark detecting results can be classified their effects into three groups as follows:
Group I: No effect on actual watermark robustness has been found under attacking of sharpness which shown the correlation coefficient = 1, for all percentage of sharpness filtering variation, range 0 – 100%.
Group II: Very low effect on actual watermark robustness has been found under attacking of compression, at the range 60 – 100 % of JPEG compression quality (see Fig. 5), scaling, at the range 11 – 60% of scaling factor (see Fig. 6), blur, at the range 3x3 – 13x13 of blur filtering mask size (see Fig. 7), contrasting, at the range 1 – 45%, shearing, at the range 0 – 0.05 and rotating, at the range of angle between 1- 4 degrees (see Fig. 8) which shown the acceptable correlation coefficient values between 0.5 - 1, for all kinds of attacking values specified above.
Group III: High effect on actual watermark robustness has been found under attacking of brightness, at the range higher than 5%, Salt & Pepper noise signal adding, at the range higher than 1.5%, and Gaussians noise signal adding, at all ranges shown unacceptable correlation coefficient values which mostly near “0”. It has shown that noise disturbing
Figure 5. Correlation coefficient of watermarked multi-language document images which shown that all invisible actual watermarks still be
reasonably detected after making the JPEG compression quality (%) down to 60% level.
Figure 6. Correlation coefficient of watermarked multi-language document images in scaling factors which shown its robustness if varied
scaling factors between 11% and 120 %.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
85
Figure 7. Correlation coefficient of watermarked multi-language document images after attacked with blur filtering mask size between 3x3
and 15x15. It shown that its robustness could be kept at the blur filtering mask size 13x13.
Figure 8. Correlation coefficient values of watermarked multi-language document images which rotation angle varied from 1 to 4
degrees which has still shown its robustness.
signals is the most complicated factor to affect watermark detecting, if there are more disturbing signals, detecting for watermarks can be difficult to be done.
V. CONCLUSIONS
The correlation coefficient measurement, acceptable values between 0.5 – 1, which has been used for detecting the invisible grayscale watermark existing on the multi-language document image file, has shown that the cross-ratio theory applying could be effectively used to build up the reasonably watermarking robustness against the geometric distortion attacks; scaling, especially at the range higher than 11%, shearing (0 – 0.05) and rotating (1 – 4 degrees) and some manipulating attacks; compressing, at the range higher than 60%, contrasting (1 – 45%), sharpness (0 – 100%) and blur filtering which mask size should not be greater than 13x13. This built-up robustness is based on four collinear points which have been used as the watermark embedding patterns and the referred points for watermark detection. It is not necessarily to be inversely transformed before detecting watermark positions, but can be directly detecting watermark position at once, and it is not necessarily compared with original document image without watermark. Confirmation of our document from watermark detecting can be proved directly through comparison of the existed watermark pattern.
The experiment has also shown that it can be applied for all multi-language document images, not depending on specific language attributes like some methods mentioned above which mostly focused on testing only one specific language and not thoroughly explored the possible attacks which affect the watermark robustness. This is the original step of applying the cross ratio theory for grayscale multi-language document image watermarking. For the next step we hope to improvise it to build up robustness significantly higher, especially resist the noise signal adding, rotating and brightness attacks.
REFERENCES
[1] J. T. Brassil, et al., ”Electronic Marking and Identification Techniques to Discourage Document Copying”, IEEE Journal on Selected Areas in Communications, Vol.13, No.8, Oct 1995, pp.1495-1504.
[2] S.H. Low, N.F. Maxemchuk, J.T. Brassil, and L. O’Gorman, “Document marking and identification using both line and word shifting”, Proceedings of the Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’95), vol.2, 1995, pp. 853-860.
[3] Y. Kim, K. Moon, and I. Oh, “A Text Watermarking Algorithm based on Word Classification and Inter-word space Statistics”, Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03), 2003, pp. 775-779.
[4] A.M. Alattar and O.M. Alattar, “Watermarking electronic text documents containing justified paragraphs and irregular line spacing”, Proceedings of SPIE – Volume 5306, Security, Steganography, and Wateramaking of Multimedia Contents VI, 2004, pp. 685-695.
[5] D. Huang, and H. Yan, “Interword Distance Changes Represented by Sine Waves for Watermarking Text Images”, IEEE Trans. Circuits and Systems for Video Technology, Vol. 11, No. 12, pp. 1237-1245, 2001.
[6] Du Min and Zhao Quanyou, “Text Watermarking Algorithm based on Human Visual Redundancy”, AISS Journal, Advanced in Information Sciences and Service Sciences. Vol. 3, No. 5, pp. 229-235, 2011.
[7] W. Zhang, Z. Zeng, G. Pu, and H. Zhu, “ Chinese Text Watermarking Based on Occlusive Components”, IEEE, pp. 1850-1854, 2006.
[8] M.H. Shirali-Shahreza, and M. Shirali-Shahreza, “A New Approach to Persian/ Arabic Text Steganography”, IEEE International Conference on Computer and Information Science, 2006.
[9] Ranganathan Suganya, Johnsha Ahamed, Ali, Kathirvel.K & Kumar, Mohan, “Combined Text Watermarking”, International Journal of Computer Science and Information Technologies, Vol. 1 (5) , pp. 414-416, 2010.
[10] U. Topkara, M. Topkara, M. J. Atallah, “The hiding Virtues of Ambiguity: Quantifiably Resilient Watermarking of Natural language Text through Synonym Substitutiions”, In Proc. Of ACM Multimedia andSecurity Conference, 2006
[11] Samphaiboon Natthawut, and Dailey Matthew N, "Steganography in Thai text ", In Proc. of 5th International Conference on Electrical EngineeringElectronics Computer Telecommunications and Information Technology , IEEE ECTI-CON 2008, pp. 133-136, 2008.
[12] Coxeter, H. S. M. and Greitzer, S. L. “Collinearity and Concurrence.”, Geometry Revisited, Ch. 3, Math. Assoc. Amer, 1967, pp. 51-79.
[13] R. Mohr and L. Morin, “Relative Positioning from Geometric Invariants,” Proceedings of the Conference on Computer Vision and Pattern Recognition, 1991, pp. 139-144.
[14] Antonio, F. “Faster Line Segment Intersection”, Graphics Gems III, Ch. IV.6, Academic Press, 1999, pp. 199-202 and 500-501.
[15] J. Cox, M. L. Miller, and J. A. Bloom, [Digital Watermarking], Morgan Kaufmann Publishers, 2002
The Eighth International Conference on Computing and Information Technology IC2IT 2012
86
PCA Based Handwritten Character RecognitionSystem Using Support Vector Machine & Neural
NetworkRavi Sheth1, N C Chauhan2, Mahesh M Goyani3, Kinjal A Mehta4
1Information Technology Dept., A.D Patel Institute of Technology, New V V nagar-388120, Gujarat, India2Information Technology Dept., A.D.Patel Institute of Technology, New V V nagar-388121, Gujarat, India
3Computer Engineering. Dept., L.D.college of engineering, Ahmadabad, Gujarat, India4Electronics and Communication Dept., L.D. college of engineering, Ahmadabad, Gujarat, India
Abstract— Pattern recognition deals with categorization of inputdata into one of the given classes based on extraction of features.Handwritten Character Recognition (HCR) is one of the well-known applications of pattern recognition. For any recognitionsystem, an important part is feature extraction. A proper featureextraction method can increase the recognition ratio. In thispaper, a Principal Component Analysis (PCA) based featureextraction method is investigated for developing HCR system.PCA is a useful statistical technique that has found application infields such as face recognition and image compression, and is acommon technique for finding patterns in data of highdimension. These method have been used as features of thecharacter image, which have been later on used for training andtesting with Neural Network (NN) and Support Vector Machine(SVM) classifiers. HCR is also implemented with PCA andEuclidean distance.
Keywords: Pattern recognition, handwritten character recognition,feature extraction, principal component analysis, neural network,support vector machine, euclidean distance.
I. INTRODUCTION
andwritten character recognition is an area of patternrecognition that has become the subject of research
during the last few decades. Handwriting recognition hasalways been a challenging task in pattern recognition. Manysystems and classification algorithms have been proposed inthe past years. Techniques ranging from statistical methodssuch as PCA and Fisher discriminate analysis [1] to machinelearning like neural networks [2] or support vector machines[3] have been applied to solve this problem. The aim of thispaper is to recognize the handwritten English character byusing PCA with three different methods as mentioned above.The handwritten characters have infinite variety of stylevarying from person to person. Due to this wide range ofvariability, it is very difficult for a machine to recognize ahandwritten character; the ultimate target is still out of reach.There is a huge scope of development in the field ofhandwritten character recognition. Any future process in thefield of handwritten character recognition will able to increase
the communication between machine and men. GenerallyHCR is divided in four major parts as shown in Fig. 1[4].These phases include binarization, segmentation, featureextraction and classification. Few major problems faced whiledealing with segmented, handwritten character recognition isthe ambiguity and illegibility of the characters. The accuraterecognition of segmented characters is important for therecognition of word based on segmentation [5]. Featureextraction is most difficult part in HCR system.
Figure 1: Block diagram of HCR system.
But before recognition, the handwritten characters have to beprocessed to make them suitable for recognition. Here, weconsider the processing of entire document containingmultiple lines and many characters in each line. Our aim is torecognize characters from the entire document. Thehandwritten document has to be free from noise, skewness,etc. The lines and words have to be segmented. The charactersof any word have to be free from any slant angle so that thecharacters can be separated for recognition. By thisassumption, we try to avoid a more difficult case of cursivewriting. Segmentation of unconstrained handwritten text lineis difficult because of inter-line distance variability, base-lineskew variability, different font size and age of document [5].During the next step of this process features are extracted fromthe segmented character. Feature extraction is a very importantpart in character recognition process. Extracted feature hasbeen applied to classifiers which recognized character basedon trained features. In second section, we have described
H
Input Binarization Segmentation FeatureExtraction
ClassificationOutput
The Eighth International Conference on Computing and Information Technology IC2IT 2012
87
feature extraction method in brief and described principalcomponent analysis method. In the next session we havediscussed neural network and SVM and Euclidean distancemethodology.
II. FEATURE EXTRCTION
Any given image can be decomposed into several features.The term ‘feature’ refers to similar characteristics. Therefore,the main objective of a feature extraction technique is toaccurately retrieve these features. The term “featureextraction” can thus be taken to include a very broad range oftechniques and processes to the generation, update andmaintenance of discrete feature objects or images [6]. Featureextraction is the most difficult part in HCR system. Thisapproach gives the recognizer more control over the propertiesused in identification. Character classification task recognizesthe character which is compared with the standard value thatcomes out the learning character, and the character should becorresponded to the document image that is matching a settingdocument style in the document style setting part. Here wehave investigated and developed PCA based feature extractionmethod.
Principal component analysis
PCA is a useful statistical technique that has found applicationin fields such as face recognition and image compression, andis a common technique for finding patterns in data of highdimension [7]. It is a way of identifying patterns in data, andexpressing the data in such a way as to highlight theirsimilarities and differences. Since patterns in data can be hardto find in data of high dimension, where the graphicalrepresentation is not available, PCA is a powerful tool foranalyzing data [7].The other main advantage of PCA is that once you have foundthese patterns in the data, and you compress the data, i.e. byreducing the number of dimensions, without much loss ofinformation [7].Principal component analysis (PCA) is a mathematicalprocedure that uses an orthogonal transformation to convert aset of observations of possibly correlated variables into a setof values of uncorrelated variables called principalcomponents. The number of principal components is less thanor equal to the number of original variables. Thistransformation is defined in such a way that the first principalcomponent has as high a variance as possible (that is, accountsfor as much of the variability in the data as possible), and eachsucceeding component in turn has the highest variancepossible under the constraint that it be orthogonal to(uncorrelated with) the preceding components. Principalcomponents are guaranteed to be independent only if the dataset is jointly normally distributed. Before startingmethodology first of all it’s important to discuss followingterm which are related to PCA [7].
Eigenvector and Eigenvalues
The eigenvectors of a square matrix are the non-zero vectors that, after being multiplied by the matrix, remainproportional to the original vector (i.e., change only inmagnitude, not in direction). For each eigenvector, thecorresponding eigenvalues is the factor by which theeigenvector changes when multiplied by the matrix. Anotherproperty of eigenvectors is that even if we scale the vector bysome amount before we multiply it, we still get the samemultiple of it as a result. Another important thing to known isthat when mathematicians find eigenvectors, they like to findthe eigenvectors whose length is exactly one. This is because,as we know, the length of a vector doesn’t affect whether it’san eigenvector or not, whereas the direction does. So, in orderto keep eigenvectors standard, whenever we find aneigenvector we usually scale it to make it have a length of 1,so that all eigenvectors have the same length [7].
Steps for generating principle components ofcharacter and digit images:
Step 1: Get some data and find mean of each dataIn this work we have used our own made-up data set. Data setis nothing but handwritten character A-J and 1-5 digits whichcontains 30 samples of each character or digit. And find themean using equation 5.
n
k
kXN
M1
1(1)
Where, M=Mean, N=Total no. of i/p images, X= I/p image
Step 2: Subtract the meanFor PCA to work properly, we have subtracted the mean fromeach of the data dimensions. The mean subtracted is theaverage across each dimension (use equation 2), where M isa mean which we have calculated using equation 1.So, all the
X values have X (the mean of the x values of all the data
points) subtracted, and all the Y values have Y subtractedfrom them. This produces a data set whose mean is zero.
MX n (2)
Step 3: Calculate the covariance matrixNext step is to find out covariance matrix using equation 3.
Tkn
k
k MXMXN
M 1
1(3)
Step 4: Calculate the eigenvectors and eigenvalues of thecovariance matrixSince the covariance matrix is squared, we have calculated theeigenvectors and eigenvalues for this matrix. By this process
The Eighth International Conference on Computing and Information Technology IC2IT 2012
88
of taking the eigenvectors of the covariance matrix, we havebeen able to extract lines that characterize the data. The rest ofthe steps involve transforming the data so that it is expressedin terms of them lines.
Step 5: Choosing components and forming a feature vectorHere is where the notion of data compression and reduceddimensionality comes into it. In general, once eigenvectors arefound from the covariance matrix, the next step is to orderthem by eigenvalues, highest to lowest. This gave us thecomponents in order of significance. What needs to be donenow is you need to form a feature vector, which is just a fancyname for a matrix of vectors. This is constructed by taking theeigenvectors that you want to keep from the list ofeigenvectors, and forming a matrix with these eigenvectors inthe columns.
Step 6: Deriving the new data setThis final step in PCA, and is also the easiest. Once we havechosen the components (eigenvectors) that we wish to keep inour data and formed a feature vector, we have simply took thetranspose of the vector and multiply it on the left of theoriginal data set, transposed.
III. CLASSIFICATION METHODS
A. Neural Network
Artificial neural networks (ANN) provide the powerfulsimulation of the information processing and widely used inpatter recognition application. The most commonly usedneural network is a multilayer feed forward network whichfocus an input layer of nodes onto output layer through anumber of hidden layers. In such networks, a back propagationalgorithm is usually used as training algorithm for adjustingweights [9]. The back propagation model or multi-layerperceptron is a neural network that utilizes a supervisedlearning technique. Typically there are one or more layersof hidden nodes between the input and output nodes.Besides, a single network can be trained to reproduce allthe visual parameters as well as many networks can betrained so that each network estimates a single visualparameter. Many parameters, such as training data,transfer function, topology, learning algorithm, weightsand others can be controlled in the neural network [9].
B. Support Vector Machine
The main purpose of any machine learning technique is toachieve best generalization performance, given a specificamount of time and finite amount training data, by striking abalance between the goodness of fit attained on a giventraining dataset and the ability of the machine to achieve error-free recognition on other datasets [10].
Figure 2: neural network design
With this concept as the basis, support vectormachines have proved to achieve good generalizationperformance with no prior knowledge of the data. The maingoal of an SVM [10] is to map the input data onto a higherdimensional feature space nonlinearly related to the inputspace and determine a separating hyper plane with maximummargin between the two classes in the feature space.
Figure 3: SVM margin and support vectors [10]
Main task of SVM is to finds this hyper plane usingsupport vectors (“essential” training tuples) and margins(defined by the support vectors).Let data D be (Z1, y1), …, (Z|D|, y|D|), where Xi is the set oftraining tuples associated with the class labels yi which haseither +1 or -1 value [11]. There are uncountable (infinite)lines (hyper planes) separating the two classes but we want tofind the best one (the one that minimizes classification erroron unseen data). SVM searches for the hyper plane with thelargest margin, i.e., Maximum Marginal Hyper plane (MMH)[11]. The basic concept of SVM can be summarized as,
A separating hyper plane can be written as [11]0* CZX (4)
Where X= ..... 2,1 nXXX is a weight vector and c a
scalar (bias).
For 2-D it can be written as [11]
022110 ZXZXX , where cX 0 is additional
weight.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
89
The hyper plane defining the sides of the margin:
1:1 22110 ZXZXXH , for 1iY and
1:1 22110 ZXZXXH , for 1iY
Any training tuples that fall on hyper planes H1 or H2 (i.e.,the sides defining the margin) are support vectors [11].
If data were 3-D (i.e., with three attributes), then we haveto find the best separating plane.
After we got a trained support vector machine, we use it toclassify test (new) tuples. Based on Lagrangian [11]formulation, the MMH can be rewritten as the decisionboundary.
01
)( CZZYZTD TL
iiii
(5)
Where, Yi is the class label of support vector iZ , ZTis a test
tuples, i is Lagrangian multiplier, L is the number of
support vectors.
C. Euclidean distance.
Euclidean distance is most popular technique for finding thedistance between to matrices or images. Let X, Y be two
nXm images, ).....( 2,1 nmXXXX , ).....( 2,1 nmYYYY .
Euclidian distance between X and Y is given by
mn
k
kk YXYXd1
22 )(),( (6)
IV. EXPERIMENT AND RESULTS
In this work the PCA method as discussed in section II wasimplemented in Matlab environment. The extracted data isused as features for two classifiers, namely, neural networkand support vector machine. We have prepared a real-timedataset comprising of A to J characters and digit 1 to 5. Thedata set was prepared by taking handwritings of differentpersons in a specific format. We have taken 30 samples ofeach character and digit, so finally our dataset contains total450 samples for characters A to J and digits 1 to 5. We haveapplied PCA method on this database and prepared featurematrix PC_A. At the other side for testing purpose, we havetaken 30 different images. Binarization, segmentation isapplied one by one on input image. Same feature matrix PC_Bis prepared for all the segmented characters.
A. Implementation Results of ANN & PCA based characterrecognitionPrepared PC_A matrix is given as an input to the neuralnetwork for training purpose. Similarly PC_B matrix is givento this trained network for testing purpose. The overallaccuracy of 85% was obtained for the test data using ANN.
B. Implementation Results of SVM & PCA based characterrecognitionSimilarly as we have described above, PC_A matrix is givenas an input to the SVM for training purpose. Similarly PC_Bmatrix is given to this trained network for testing purpose. Wehave used libsvm package [12] for the classification purpose.The overall accuracy of 92% was obtained for the test datausing SVM.
C. Implementation Results of Euclidean distance & PCAbased character recognitionIn this method for recognition purpose we have found theEuclidean distance between PC_A and PC_B and found theminimum index and based on this index we have found whichcharacter is recognized. PC_A and PC_B prepared using stepsthat we have discussed in previous section. We have measuredover all accuracy of this method is 90%.
D. Comparison of Recognition using ANN, SVM classifiersand Euclidean distance.
In table 1 we have listed different methods and accuracy. Asshown in table we can easily say that overall accuracy of PCA(SVM) is good compare to PCA (NN) and PCA (Euclideandistance) method. If we compare these methods on basis oftraining time then also SVM methods required less timecompare to neural network and Euclidean distance. Butdrawback of SVM methods is we have to generate SVMformat training and testing files, while in case of othermethods it’s not required. Now if we compare individualcharacter accuracy then also PCA (SVM) gives good resultcompare to other method.
Table 1: Comparison of Overall Accuracy
Sr.no Method Structure/Parameter Accuracy1 PCA(Neural
Network)[25 30 6 25] 85%
2 PCA (SVM) Kernel-RBF(Redial Bias Function)
Cost-1Gamma-1
92%
3 PCA(Euclideandistance )
- 90%
The Eighth International Conference on Computing and Information Technology IC2IT 2012
90
Table2: Comparison of Individual Character AccuracySr.no Letter or
Digit
Accuracy
Of
PCA-
SVM
(%)
Accuracy
Of
PCA-ANN
(%)
Accuracy
Of
PCA-
Euclidean
Distance
(%)
1 A 96 80 98
2 B 99 80 98
3 C 99 100 96
4 D 95 70 96
5 E 96 80 95
6 F 97 80 95
7 G 96 90 95
8 H 95 80 98
9 I 98 75 96
10 J 97 80 96
11 1 97 80 95
12 2 96 90 95
13 3 95 80 95
14 4 99 80 98
15 5 96 80 95
V. CONCLUSION
A simple and an efficient off-line handwritten characterrecognition system using a new type of feature extraction,namely, PCA is investigated. Selection of feature extractionmethod is most important factor for achieving highrecognition ratio. In this work, we have implemented PCAbased feature extraction method. With the use of thisobtained feature, we have trained the neural network aswell as SVM to recognition character. We have alsoimplemented character recognition with PCA andeuclidean distance. In the investigated work all threemethod showed the overall recognition of 85% for PCAbased neural network, 92% for PCA based SVM and 90%for PCA with Euclidean distance.
REFERENCES
[1] S.. Mori, C.Y. Suen and K. Kamamoto, “Historical review ofOCR research and development,” Proc. of IEEE, vol. 80, pp.1029-1058, July 1992.
[2] V.K. Govindan and A.P. Shivaprasad, “Character Recognition– A review,” Pattern Recognition”, vol. 23, no. 7, pp. 671-683, 1990.
[3] H.Fujisawa, Y.Nakano and K.Kurino, “Segmentation methodsfor character recognition from segmentation to document
structure analysis”. Proceeding of the IEEE, vol.80, andpp.1079-1092. 1992.
[4] Ravi K Sheth, N.C.Chauhan, Mahesh M Goyani,” AHandwritten Character Recognition Systems using CorrelationCoefficient”, V V P Rajkot, 8-9 April 2011,ISBN NO: 978-81-906377-5-6, pp 395-398..
[5] Pal, U. and B.B. Chaudhuri, “Indian script characterrecognition: A survey,” Pattern Recognition”, vol. 37, no.9,pp. 1887-1899, 2004.
[6] Ravi K Sheth, N C Chauhan, M G Goyni, Kinjal A Mehta,”Chain code based Handwritten character recognition systemusing neural network and SVM”, ICRTITCS-11, 9-10December, Mumbai.
[7] Lindsay I Smith,” A tutorial on Principal ComponentsAnalysis”, February 26, 2002
[7] Dewi Nasien, Habibollah Haron, Siti Sophiayati Yuhaniz“The Heuristic Extraction Algorithms for Freeman ChainCode of Handwritten Character”, International Journal ofExperimental Algorithms, (IJEA), Volume (1): Issue (1)
[8] S. Arora" Features Combined in a MLP-based System toRecognize Handwritten Devnagari Character”, Journal ofInformation Hiding and Multimedia Signal Processing,Volume 2, Number 1, January 2011
[9] H. Izakian, S. A. Monadjemi, B. Tork Ladani, and K.Zamanifar “Multi-Font Farsi/Arabic Isolated CharacterRecognition Using Chain Codes”, World Academy ofScience, Engineering and Technology 43 2008
[10] C. J. C. Burges, “A tutorial on support vector machines forpattern recognition. Data Mining and Knowledge Discovery”,1998, pp 121-167.
[11] Jiawei Han and Micheline Kamber ”Data Mining Conceptsand Techniques”, 2nd Edi, MK publication, 2006, pp 337-343
[12] Chih-Jen Lin,”A Library for Support Vector Machines”,http://www.csie.ntu.edu.tw/~cjlin/libsvm/
The Eighth International Conference on Computing and Information Technology IC2IT 2012
91
Web Mining Using Concept-based Pattern Taxonomy Model
Sheng-Tang Wu Dept. of Applied Informatics and
Multimedia Asia University, Taichung, Taiwan
Yuefeng Li Faculty of Science and Technology
Queensland University of Technology Brisbane, Australia [email protected]
Yung-Chang Lin Dept. of Applied Informatics and
Multimedia Asia University, Taichung, Taiwan
Abstract— In the last decade, most of the current Pattern-based Knowledge Discovery systems use statistical analyses only (e.g. occurrence or frequency) in the phase of pattern discovery. The downside of these approaches is that two different patterns may have the same statistical feature, yet one pattern of them may, however, contribute more to the meaning of text than the other. Therefore, how to extract the concept patterns from the data and then apply these patterns to the Pattern Taxonomy Model becomes the main purpose of this project. In order to analyze the concept of documents, the Natural Language Processing (NLP) technique is used. Moreover, with the support from lexical Ontology (e.g. Propbank), a novel concept-based pattern structure called “verb-argument” is defined and equipped into the proposed Concept-based Patten Taxonomy Model (CPTM). Hence, by combining the techniques from several fields (including NLP, Data Mining, Information Retrieval, and Text Mining), this paper aims to develop an effective and efficient model CPTM to address the aforementioned problem. The proposed model is examined by conducting real Web mining tasks and the experimental results show that CPTM model outperforms other methods such as Rocchio, BM25 and SVM.
Keywords- Concept Pattern; Pattern Taxonomy; Knowledge Discovery; Web Mining; Data Mining
I. INTRODUCTION
Due to the rapid growth of digital data made available in the recent years, knowledge discovery and data mining have attracted great attention with an imminent need for turning such data into useful information and knowledge. Knowledge discovery[3, 5] can be viewed as the process of nontrivial extraction of information from large databases, information that is implicitly presented in the data, previously unknown and potentially useful for users. In the whole process of knowledge discovery, this study especially focuses on the phase between the transformed data and the discovered knowledge. As a result, the most important issue is how to mine useful patterns using data mining techniques, and then transform them into valuable rules or knowledge.
The field of Web mining has drawn a lot of attention with the constant development of World Wide Web. Most of the Web content mining techniques try to use keywords as representatives to describe the concept of documents [4, 14]. In other words, the semantic of documents can be represented by a set of words frequently appeared in these
articles. Unfortunately, regardless of the feature of frequency, there are no other features such as the relation between words being even mentioned. Natural Language Processing (NLP) is one of the sub-fields of Artificial Intelligence (AI). The main object of NLP is to transform human language or text into a form that the machine can deal with. Generally speaking, the process of analyzing human language or text is very complex for a machine. Firstly, the text is broken into partitions or segments, and then each word is tagged with labels according to its part of speech (POS). Finally, the appropriate representatives are generated using parser based on the analysis of the relationship between words to describe the semantic information. Therefore, the relationship between discovered patterns can then be evaluated instead of using the statistical features of words. The integration of NLP and pattern taxonomy model (PTM)[17] can be expected to be able to find more useful patterns and construct more effective concept-based Pattern Taxonomies.
In order to extract and analyze the concept from documents, the statistical mechanism is insufficient in the information retrieval model during the phase of pattern discovering. One possible solution is to utilize the information provided by Ontology (such as WordNet, Treebank and Propbank[10]). Therefore, a novel Concept-based Pattern Taxonomy Model (CPTM) with support from NLP is proposed in this study for the purpose of overcoming the pre-mentioned problems caused by the use of statistical method.
The typical process of Pattern-based Knowledge Discovery (PKD) has two main steps. The first step is to find proper patterns, which can represent the concept or semantic, from training data using machine learning or data mining approaches. The second step is how to effectively use these patterns to meet the user’s needs. However, the relationship between patterns is ignored and not taken into account in the most cases while dealing with patterns. For example, although two words have exactly the same statistical properties, the contributions of each word are sometimes not equal.[15] Therefore, the main objective of this work is to extract and quantify the concept from documents using the proposed PTM-based method.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
92
II. LITERATURE REVIEW
The World Wide Web provides rich information on an extremely large amount of linked Web pages. Such a repository contains not only text data but also multimedia objects, such as images, audio and video clips. Data mining on the World Wide Web can be referred to as Web mining which has gained much attention with the rapid growth in the amount of information available on the internet. Web mining is classified into several categories, including Web content mining, Web usage mining and Web structure mining[9].
Data mining is the process of pattern discovery in a dataset from which noise has been previously eliminated and which has been transformed in such a way to enable the pattern discovery process. Data mining techniques are developed to retrieve information or patterns to implement a wide range of knowledge discovery tasks. In recent years, several data mining methods are proposed, such as association rule mining[1], frequent itemset mining [21], sequential pattern mining [20], maximum pattern mining [6] and closed pattern mining [19]. Most of them attempt to develop efficient mining algorithms for the purpose of finding specific patterns within a reasonable period of time. However, how to effectively use this large amount of discovered patterns is still an unsolved issue. Therefore, the pattern taxonomy mechanism [16] is proposed to replace the keyword-based methods by using tree-like taxonomies as concept representatives. Taxonomy is a structure that contains information describing the relationship between sequence and sub-sequence [18]. In addition, the performance of PTM-based models is improved by adopting the closed sequential patterns. The removal of non-closed sequential patterns also results in the increase of efficiency of the system due to the shrunken dimensionality.
III. CONCEPT-BASED PTM MODEL
Concept-based PTM (CPTM) model is developed using a sentence-based framework proposed to address the text classification problems. CPTM adopts the NLP techniques by parsing and tagging each word based on its POS and generating semantic patterns as a result [15]. Different from the traditional approaches, CPTM treats each sentence as a unit rather than entire article during the phase of semantic analysis. In addition, the weight of terms (words) or phrases is estimated according to their statistical characteristics (such as the number of occurrences) in the traditional methods. However, words may have different descriptive capabilities even though they own exactly the same statistic value. Therefore, the more effective conceptual patterns that are obtained, more precisely the system can determine the concept.
How can we get more useful conceptual patterns by using NLP techniques? Below is our strategy to be described. An example sentence is stated as follows:
“We have investigated that the Data Mining field, developed for many years, has encountered the issues of low frequency and high dimensionality.”
In this sentence, we can first label the words based on their POS. The verbs, written in bold, then can be used as node in a specific structure to describe the semantic meaning of sentence. By expanding words from each verb, a structure called “Verb-Argument” [10] is formed, which is defined as a conceptual pattern in this study. The following conceptual patterns are obtained from the example sentence using the above definition:
[ARG0 We] have [TARGET investigated] [ARG1 Data Mining filed, developed for many years, has encountered the issues of low frequency and high dimensionality] [ARG1 Data Mining filed] [TARGET developed] [ARGM-TMP for many years] has encountered the issues of low frequency and high dimensionality [ARG1 Data Mining filed developed for many years] has [TARGET encountered] [ARG2 the issues of low frequency and high dimensionality]
TARGET denotes the verb in the sentence. ARG0, ARG1 and ARGM-TMP are arguments appeared around TARGET. Therefore, a set of "Verb-Argument" can be discovered while applying it to a whole document. After the above process, our proposed CPTM can then analyze these conceptual patterns in the next phase.
From the data mining point of view, the conceptual patterns are defined as two types: sequential pattern and non-sequential pattern. The definition is described as follows: Firstly, let T = t1, t2, ..., tk be a set of terms, which can be viewed as words or keywords in a dataset. A non-sequential pattern is then a non-ordered list of terms, which is a subset of T, denoted as s1, s2, ..., sm (si T). A sequential pattern, defined as S = s1, s2,...,sn (siT), is an ordered list of terms. Note that the duplication of terms is allowed in a sequence. This is different from the usual definition where a pattern consists of distinct terms.
After mining conceptual patterns, the relationship between patterns has to be defined in order to establish the pattern taxonomies. Sub-sequence is defined as follows: if there exist integers 1 i1 i2 … in m, such that a1 = bi1, a2 = bi2,..., an = bin, then a sequence = a1, a2,...,an is a sub-sequence of another sequence = b1, b2,...,bm. For example, sequence s1, s3 is a sub-sequence of sequence s1, s2, s3. However, sequence s3, s1 is not a sub-sequence of s1, s2, s3 since the order of terms is considered. In addition, we can also say sequence s1, s2, s3 is a super-sequence of s1, s3. The problem of mining sequential patterns is to find a complete set of sub-sequences from a set of sequences whose support is greater than a user-predefined threshold (minimum support).
We can then acquire a set of frequent sequential concept- patterns CP for all documents d D+, such that CP = p1, p2,…, pn. The absolute support suppa(pi) for all pi CP is obtained as well. We firstly normalize the absolute support of each discovered pattern based on the following equation:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
93
10:: ,CPsupport (1)
such that
CPp ja
iai
jpsupp
psupppsupport (2)
As aforementioned, statistical properties (such as support and confidence) are usually adopted to evaluate the patterns while using data mining techniques to mine frequent patterns. However, these properties are not effective in the stage of pattern deployment and evolution[17]. The reason is the short patterns will be always the major factors affecting the performance due to their high frequency. Therefore, what we need is trying to adopt long patterns which provide more descriptive information. Another effective way is to construct a new pattern structure to gather relative information by using above-mentioned NLP techniques. Figure 1 shows the flowchart of proposed CPTM model.
Figure 1. The flow chart of CPTM Web mining model.
The pattern evolution shown in Figure 1 is used to map the pattern taxonomies into a feature space for the purpose of solving the low frequency problem of long patterns. There are two approaches proposed in order to achieve the goal: Independent Pattern Evolving (IPE) and Deployed Pattern Evolving (DPE). IPE and DPE provide different representing manners for pattern evolving as shown in Figure 2. IPE deals with patterns at the early state of individual form, instead of manipulating patterns in deployed form at the late state. DPE is constructed by compounding discovered patterns from PTM into a hypothesis space, which means this action
involves all the features including some that may come from the other patterns at the “P Level”. Therefore, both methods can be used for pattern evolution and evaluation in CPTM model.
Figure 2. Two types of Pattern Evolving.
As CPTM model is established, we apply it to the Web mining task using real Web dataset for performance evaluation. Several standard benchmark datasets are available for experimental purposes, including Reuters Corpora, OHSUMED and 20 Newsgroups Collection. The dataset used in our experiment in this study is the Reuters Corpus Volume 1 (RCV1) [13]. An RCV1 example document is illustrated in Figure 3.
Figure 3. An example RCV1 document.
RCV1 includes 806,791 English language news stories which were produced by Reuters journalists for the period between 20 August 1996 and 19 August 1997. These documents were formatted using a structured XML scheme. Each document is identified by a unique item ID and corresponded with a title in the field marked by the tag <title>. The main content of the story is in a distinct <text> field consisting of one or several paragraphs. Each paragraph is enclosed by the XML tag <p>. In our experiment, both the “title” and “text” fields are used and each paragraph in the “text” field is viewed as a transaction in a document.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
94
Figure 4 indicates the primary result of pattern analysis using Propbank scheme. The marked terms in parentheses are the verbs defined by Propbank. All of the conceptual patterns then can be generated by adopting “Verb-Argument” frame basis. At the next stage, IPE and DPE methods are used for pattern evolving. Figure 5 illustrates the output of pattern discovery using CPTM for example.
Sentence no. 1 [polic] [search] properti [own] marc dutroux chief [suspect] belgium child sex [abus] [murder] [scandal] tuesdai [found] decompos bodi two adolesc adult medic sourc Sentence no. 2 [found] two bodi [advanc] [state] decomposit sourc told [condit] anonym : : : Sentence no. 7 fate two girl [remain] mysteri Sentence no. 8 belgian girl gone [miss] recent year
Figure 4. The primary result of pattern analysis.
Figure 5. The output of pattern discovery.
In additional, the effect from the patterns derived from negative examples cannot be ignored due to their useful information[11]. There is no doubt that negative documents contain much useful information to identify ambiguous patterns during the concept learning. Therefore, it is necessary for a CPTM system to exploit these ambiguous patterns from the negative examples in order to reduce their influences. Algorithm NDP is shown as the follow.
Algorithm NDP(Ω, D+, D-) Input: A list of deployed patterns Ω; a list of positive and negative documents D+ and D-. Output: A set of term-weight pairs d. Method: 1: d ← Ø
2: τ = Threshold(D+) 3: foreach negative document nd in D- 4: if Threshold(nd) > τ 5: ∆p = dp in Ω | termset(dp) ∩ nd ≠ Ø 6: Weight shuffling for each P in ∆p 7: end if 8: foreach deployed pattern dp in Ω 9: d ← d pattern merging dp 10: end for 11: end for
IV. EXPERIMENTAL RESULTS
The effectiveness of the proposed CPTM Web mining model is evaluated by performing information filtering task with real Web dataset RCV1. The experimental results of CPTM are compared to those of other baselines, such as TFIDF, Rocchio, BM25[12] and support vector machines SVM[2, 7, 8], using several standard measures. These measures include Precision, Recall, Top-k (k = 20 in this study), Breakeven Point (b/e), Fβ-measure, Interpolated Average Precision (IAP) and Mean Average Precision (MAP).
Table 1. Contingency table.
The precision is the fraction of retrieved documents that are relevant to the topic, and the recall is the fraction of relevant documents that have been retrieved. For a binary classification problem the judgment can be defined within a contingency table as depicted in Table 1. According to the definition in this table, the measures of Precision and Recall are denoted as TP/(TP+FP) and TP/(TP+FN) respectively, where TP (True Positives) is the number of documents the system correctly identifies as positives; FP (False Positives) is the number of documents the system falsely identifies as positives; FN (False Negatives) is the number of relevant documents the system fails to identify.
The precision of top-K returned documents refers to the relative value of relevant documents in the first K returned documents. The value of K we use in the experiments is 20, denoted as "t20". Breakeven point (b/e) is used to provide another measurement for performance evaluation. It indicates the point where the value of precision equals to the value of recall for a topic.
Both the b/e and F1-measure are the single-valued measures in that they only use a figure to reflect the performance over all the documents. However, we need more figures to evaluate the system as a whole. Therefore,
The Eighth International Conference on Computing and Information Technology IC2IT 2012
95
another measure, Interpolated Average Precision (IAP) is introduced. This measure is used to compare the performance of different systems by averaging precisions at 11 standard recall levels (i.e., recall=0.0, 0.1, ..., 1.0). The 11-points measure is used in our comparison tables indicating the first value of 11 points where recall equals to 116 Experiments and Results zero. Moreover, Mean Average Precision (MAP) is used in our evaluation which is calculated by measuring precision at each relevance document first, and averaging precisions over all topics.
The decision function of SVM is defined as:
else 1
0 if 1 bxwbxwsignxh (3)
Where x is the input space; b R is a threshold and
l
i iii xyw1
for the given training data:
llii y,x,,y,x (4)
Where xi Rn and yi = +1 (-1), if document xi is labeled positive (negative). αi R is the weight of the training example xi and satisfies the following constraints:
l
i ii yi1 i 0 and 0: (5)
Since all positive documents are treated equally before the process of document evaluation, the value of αi is set as 1.0 for all the positive documents and thus the αi for the negative documents can be determined by using equation (5).
Figure 5. The result of CPTM comparing to other methods.
Figure 5 shows the interpolated 11-points in precision-recalls of CPTM comparing to other methods. It indicates that the CPTM outperforms others both at the low and high recall values. Figure 6 also reveals the similar result that CPTM has better performance in all measures comparing to those of other approaches including data mining method and traditional probability method.
Figure 6. The comparing results shown in several standard measures.
V. CONCLUSION
In general, a significant amount of patterns can be retrieved by using the data mining techniques to extract information from Web data. However, how to effectively use these discovered patterns is still an unsolved problem. Another typical issue is that only the statistic properties (such as support and confidence) are used while evaluating the effectiveness of patterns. The useful information hidden in the relationship between patterns is still not utilized. The drawback of traditional methods is that the longer patterns usually lead to a lower measure of support, resulting in the low performance. Therefore, NLP techniques can be adopted for help to define and generate the conceptual patterns. In this paper, a novel concept-based PTM Web mining model CPTM is then proposed. CPTM provides effective solutions to address the aforementioned problems by integrating NLP techniques and lexical ontology. The experimental results show that CPTM model outperforms other methods such as Rocchio, BM25 and SVM.
REFERENCES
[1] R. Agrawal, T. Imielinski, and A. Swami, "Mining association rules between sets and items in large database," in ACM-SIGMOD, 1993, pp. 207-216.
[2] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
[3] V. Devedzic, "Knowledge discovery and data mining in databases," in Handbook of Software Engineering and Knowledge Engineering. vol. 1, S. K. Chang, Ed., ed: World Scientific Publishing Co., 2001, pp. 615-637.
[4] L. Edda and K. Jorg, "Text categorization with support vector machines. how to represent texts in input space?," Machine Learning, vol. 46, pp. 423-444, 2002.
[5] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus, "Knowledge discovery in databases: an overview," AI Magazine, vol. 13, pp. 57-70, 1992.
[6] K. Gouda and M. J. Zaki, "Genmax: An efficient algorithm for mining maximal frequent itemsets," Data Mining and Knowledge Discovery, vol. 11, pp. 223-242, 2005.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
96
[7] T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," in ICML, 1997, pp. 143-151.
[8] T. Joachims, "Transductive inference for text classification using support vector machines," in ICML, 1999, pp. 200-209.
[9] C. Kaur and R. R. Aggarwal, "Web Mining Tasks and Types: A Survey," IJRIM, vol. 2, pp. 547-558, 2012.
[10] P. Kingsbury and M. Palmer, "Propbank: the next level of Treebank," in Treebanks and Lexical Theories, 2003.
[11] Y. Li, X. Tao, A. Algarni, and S.-T. Wu, "Mining Specific and General Features in Both Positive and Negative Relevance Feedback," in TREC, 2009.
[12] S. E. Robertson, S. Walker, and M. Hancock-Beaulieu, "Experimentation as a way of life: Okapi at trec," Information Processing and Management, vol. 36, pp. 95-108, 2000.
[13] T. Rose, M. Stevenson, and M. Whitehead, "The reuters corpus volume1- from yesterday's news to today's language resources," in Inter. Conf. on Language Resources and Evaluation, 2002, pp. 29-31.
[14] F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, pp. 1-47, 2002.
[15] S. Shehata, F. Karray, and M. Kamel, "A concept-based model for enhancing text categorization," in KDD, 2007, pp. 629-637.
[16] S.-T. Wu, Y. Li, and Y. Xu, "An effective deploying algorithm for using pattern-taxonomy," in iiWAS05, 2005, pp. 1013-1022.
[17] S.-T. Wu, Y. Li, and Y. Xu, "Deploying approaches for pattern refinement in text mining," in ICDM, 2006, pp. 1157-1161.
[18] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, "Automatic pattern-taxonomy extraction for web mining," in IEEE/WIC/ACM International Conference on Web Intelligence, 2004, pp. 242-248.
[19] X. Yan, J. Han, and R. Afshar, "Clospan: mining closed sequential patterns in large datasets," in SIAM Int. Conf. on Data Mining (SDM03), 2003, pp. 166-177.
[20] C.-C. Yu and Y.-L. Chen, "Mining sequential patterns from multidimensional sequence data," IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 136-140, 2005.
[21] S. Zhang, X. Wu, J. Zhang, and C. Zhang, "A decremental algorithm for maintaining frequent itemsets in dynamic databases," in International Conference on Data Warehousing and Knowledge Discovery (DaWaK05), 2005, pp. 305-314.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
97
A New Approach to Cluster Visualization Methods Based on Self-Organizing Maps
Marcin Zimniak Department of Computer Science
Chemnitz, University of Technology Chemnitz, Germany
Johannes Fliege Department of Computer Science
Chemnitz, University of Technology Chemnitz, Germany
Wolfgang Benn Department of Computer Science
Chemnitz, University of Technology Chemnitz, Germany
Abstract —The Self-Organizing Map (SOM) is one of the artifi-cial neural networks that perform vector quantization and vector projection simultaneously. Due to this characteristic, a SOM can be visualized twice: through the output space, which means considering the vector projection perspective, and through the input data space, emphasizing the vector quantiza-tion process.
This paper aims at the idea of presenting high-dimensional clusters that are ‘disjoint objects’ as groups of pairwise disjoint simple geometrical objects – like 3D-spheres for instance. We expand current cluster visualization methods to gain better overview and insight into the existing clusters. We analyze the classical SOM model, insisting on the topographic product as a measure of degree of topology preservation and treat that measure as a judge tool for admissible neural net dimension in dimension reduction process. To achieve better performance and more precise results we use the SOM batch algorithm with toroidal topology. Finally, a software solution of the approach for mobile devices like iPad is presented.
Keywords-Self- organizing maps (SOM); topology preservation; clustering; data-visualisation; dimension reduction; data-mining
I. INTRODUCTION Neural maps are biologically inspired data representa-
tions that combine aspects of vector quantization with the property of function continuity. Self-Organizing Maps (SOMs) have been successfully applied as a tool for visuali-zation, for clustering of multidimensional datasets, for image compression, and for speech and face recognition.
A SOM is basically a method of vector quantization, i.e. this technique is obligatory in a SOM. Regarding dimension-ality reduction, a SOM models data in a nonlinear and dis-crete way by representing it in a deformed lattice. The map-ping, however, is given explicitly and well defined only for the prototypes and in most cases only offline algorithms implement SOMs. For our purpose we concern the so-called ‘batch’ version of the SOM which can easily be derived from
the basic model: instead of updating prototypes one by one, they are all moved simultaneously at the end of each run, as in a standard gradient descent. In order to reduce border effects in the neural network we use a toroidal topology. For more details concerning the degree of organization we refer the reader to [1]. Applying this approach, we work with a so-called well-organized neural grid. One of our main tasks concerning the application of Self-Organizing Maps is to implement a suitable mapping procedure that should result in a topology preserving projection of high-dimensional data onto a low dimensional lattice.
In our project we consider only three admissible dimen-sions of output space, namely 𝑑! = 1,2,3 for a given neu-ronal grid 𝐴. However, in general, the choice of the dimen-sion for the neural net does not guarantee to produce a topol-ogy-preserving mapping. Thus, the interpretation of the re-sulting map may fail. Therefore, we introduce the very im-portant concept of a topologically preserving mapping, which means that similar data vectors are mapped onto the same or neighbored locations in the lattice and vice versa.
In this paper we propose a new concept of cluster visuali-zation; we illustrate clusters as disjoint objects in pairs of simple geometrical objects like spheres in 3D centered at best matching units (BMUs) coordinates within a neural network of admissible dimension.
Our paper is organized as follows: in section 2 we give a precise mathematical description of SOM including the to-pology preservation measure (topographic product) as a measure for an admissible dimension of the output space. In section 3 we present existing methods of cluster visualization followed by the extension of a graphical visualization meth-od for providing a new solution. In section 4 we demonstrate a software realization approach for our new visualization concept. Finally, we outline our conclusion and emerging further work in section 5.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
98
II. MATHEMATICAL BACKGROUND OF THE SOM One of the powerful approaches to adopt our cluster con-
siderations within SOM is the application of Self-Organizing Maps to implement a suitable mapping procedure, which should result in a topology-preserving projection of the high-dimensional data onto a low dimensional lattice. In most applications a two- or three-dimensional SOM lattice is the common choice of lattice structure because of its easy visual-ization. However, in general, this choice does not guarantee to produce a topology-preserving mapping. Thus, the inter-pretation of the resulting map may fail. Topology preserving mapping means that similar data vectors are mapped onto the same or neighbored locations in the lattice and vice versa.
A. SOM Algorithm and Toplogy Preservation Within the framework of dimensionality reduction, SOM
can be interpreted intuitively as a kind of nonlinear but dis-crete PCA. Formally, Self-organizing maps (SOM) as a spe-cial kind of artificial neural network map project data from some (possibly high-dimensional) input space 𝑉 ⊆ ℜ!! onto a position in some output space (neural map) 𝐴, such that a continuous change of a parameter of the input data should lead to a continuous change of the position of a localized excitation in the neural map. This property of neighborhood preservation depends on an important feature of the SOM, its output space topology, which has to be predefined before the learning process to be started. If the topology of 𝐴 (i.e. its dimensionality and its edge length ratios) does not match that of the data shape, neighborhood violations will occur. This can be written in a formal way by defining the output space positions as 𝑟 = 𝑖!, 𝑖!, 𝑖!, . . , 𝑖!! , 1 < 𝑖! < 𝑛! with 𝑁 = 𝑛!×𝑛!×𝑛!. .×𝑛! where 𝑛! , 𝑘 = 1. .𝑚 represents the dimension of 𝐴 (i.e. length of the edge of the lattice) in kth-direction. In general, other arrangements are possible, e.g. the definition of a connectivity matrix. Nevertheless, we consider hypercubes in our project. We associate a weight vector or pointer 𝑤! with each neuron 𝑟 ∈ 𝐴 in 𝑉.
The mapping Ψ!→! is realized by rule: the winner takes it all (WTA). It updates only one prototype (the BMU) at each presentation of a datum. WTA is the simplest rule and in-cludes the classical competitive learning as well as the fre-quency-sensitive competitive learning
Ψ!→!: 𝑣 ↦ 𝑠 = argmin!"# 𝑣 − 𝑤! (1)
where the corresponding reverse mapping is defined as Ψ!→!: 𝑟 ↦ 𝑤!. These two functions together determine the map
𝑀 = Ψ!→!,Ψ!→! (2)
realized by the SOM network. All data points 𝑣 ∈ ℜ! that are mapped onto the neuron 𝑟 make up its receptive field Ω!! . The masked receptive field of neuron 𝑟 is defined as the intersection of its receptive field with 𝑉 namely
Ω! = 𝑣 ∈ 𝑉: 𝑟 = Ψ!→!(𝑣) . (3)
Therefore, the masked receptive fields Ω! are closed sets. All masked receptive fields form the Voronoi tessellation (diagram) of 𝑉. If the intersection of two masked receptive fields Ω! ,Ω!! is non-vanishing (Ω! ∩ Ω!! ≠ ∅), we call both of them Ω!Ω!! neighbored. The neighborhood relations form
a corresponding graph structure in 𝐺! in 𝐴: two neurons are connected in 𝐺! if and only if their masked receptive fields are neighbored. The graph 𝐺! is called the induced Delau-nay-graph. For further details we refer the reader to [2]. Due to the bijective relation between neurons and weight vectors, 𝐺! also represents the Delaunay graph of the weights (Fig. 1).
To achieve the map 𝑀, the SOM adapts the pointer posi-tions during the presentation of a sequence of data points 𝑣 ∈ 𝑉 selected from a data distribution 𝑃(𝑉) as follows:
∆𝑤! = 𝜀 ∙ ℎ!"(𝑣 − 𝑤!), (4)
where 0 ≤ 𝜀 ≤ 1 denotes learning rate, and ℎ!" is the neighborhood function, usually chosen to be of Gaussian shape:
ℎ!" = exp − !!! !
!!!. (5)
We note that ℎ!" depends on the best matching neuron defined in (1).
Topology preservation in SOMs is defined as the preser-vation of the continuity of the mapping from the input space onto the output space. More precisely, it is equivalent to the continuity of 𝑀 (in the mathematical topological sense) be-tween the topological spaces with a properly chosen metric in both 𝐴 and 𝑉. Thus, to indicate the topographic violation we need metric and topological conditions, e.g. in Fig. 2 a) a perfect topographic map is indicated, whereas in 2 b) topog-raphy is violated. The pair of nearest neighbors 𝑤!,𝑤! is mapped onto the neurons 1 and 3, which are not nearest neighbors. The distance relation between both is inverted as well: 𝑑! 𝑤!,𝑤! > 𝑑!(𝑤!,𝑤!) but 𝑑! 1,2 < 𝑑!(1,3) . Thus, topological and metric conditions are violated. For detailed considerations we refer to [3]. The topology preserv-ing property can be used for immediate evaluations of the resulting map, e.g. for interpretation as a color space which we applied in Sec. 3.
As we already pointed out in the introduction, violations of the topographic mapping may raise false interpretations. Several approaches were developed to measure the degree of topology preservation for a given map. We chose the topo-graphic product 𝑃, which relates the sequence of input space neighbors to the sequence of output space neighbors for each neuron. Instead of using the Euclidean distances between the
Delaunaytriangulaton
Voronoi diagram
Figure 1. The Delaunay triangulation and Voronoi diagram are dual to each other in the graph theoretical sense.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
99
weight vectors, this measure applies the respective distances 𝑑!!(𝑤! ,𝑤!!) of minimal path lengths in the induced Delau-nay graph 𝐺! of 𝑤! . During the computation of 𝑃 the se-quences 𝑛!! (𝑟) of the mth neighbors of 𝑟 in 𝐴 and 𝑛!! (𝑟), describing the mth neighbor of 𝑤! have to be determined for each node 𝑟. These sequences and further averaging over neighborhood orders 𝑚 and nodes 𝑟 finally lead to
P = 1N 2 − N
12mm=1
N−1
∑r∑ log
dGV wnlA (r )( )
dGV wnlV (r )( )
⋅dV r,nl
A (r)( )dV r,nl
V (r)( )l=1
m
∏%
&
''
(
)
**
. (6)
The sign of 𝑃 approximately indicates the relation be-tween the input and output space topology whereas 𝑃 < 0 corresponds to a too low-dimensional input space, 𝑃 ≈ 0 indicates an approximate match, and 𝑃 > 0 corresponds to a too high-dimensional input space.
In the definition of 𝑃, topological and metric properties of a map are mixed. This mixture provides a simple mathe-matical characterization of what 𝑃 actually measures. How-ever, for the case of perfect preservation of an order relation, identical sequences 𝑛!! (𝑚) and 𝑛!! (𝑚) result in 𝑃 taking on the value 𝑃 = 0.
Application of SOMs to very high-dimensional data can produce difficulties that may result from the so-called ’curse of dimensionality’: the problem of sparse data caused by the high data dimensionality. We refer to approach proposed by KASKI in [4].
B. Application of the Topographic Product involving real-world Data Data set in case of speech feature vectors (𝐷! = 19, di-
mension of input space) obtained from several speakers utter-ing the German numerals1. We see (Fig. 3) in that case topo-graphic product single out 𝑑! ≈ 3.
C. Batch Version of Kohonen’s Self-Organizing Map Depending on the application, data observations may ar-
rive consecutively or alternatively, the whole data set may be available at once. In the first case, an online algorithm is applied. In the second case, an offline algorithm suffices. More precisely, offline or batch algorithms cannot work until the whole set of observations is known. On the contrary, online algorithms typically work with no more than a single
1 The data is available at III. Physikalisches Institut Goettingen;
previously investigated in [8], [9].
observation at a time. For most methods the choice of the model largely orients the implementation towards one or the other type of algorithm. Generally, the simpler the model, the more freedom is left to the implementation. In our project we apply the batch version of the SOM described in the follow-ing algorithm:
1) Define the lattice by assigning the low-dimensional coordinates of the prototypes in the embedding space.
2) Initialize the coordinates of the prototypes in the data space.
3) Assign to 𝜀 and to the neighborhood function ℎ!" their scheduled values for epoch q.
4) For all points 𝑣 in the data set, compute all prototypes as in (1) and update them according to (4).
5) Continue with step 3 until convergence is reached (i.e. updates of the prototypes become negligible).
III. DATA MINING WITH SOM If a proper SOM is trained according to the above men-
tioned criteria several methods for representation and post-processing can be applied. In case of a two dimensional lat-tice of neurons many visualization approaches are known. The most common method for visualization of SOMs is to project the weight vectors in the first dimension of the space spanned by the principle components of the data and con-necting these units to the respective nodes in the lattice that are neighbored. However, if the shape of the SOM lattice is hypercubical there are several more ways to visualize the properties of the map. For our purpose we focus only on those that are of interest in our application. An extensive overview can be found in [6].
A. Current Cluster Visualization Methods of SOM An interesting evaluation is the so-called U-matrix intro-
duced by [5] (Fig. 4). The elements of the matrix U represent the distances between the respective weight vectors and are neighbors in the neural network A. Matrix U can be used to determine clusters within the weight vector set and, hence, within the data space. Assuming that the map is topology preserving, large values indicate cluster boundaries. If the lattice is a two-dimensional array the U-matrix can easily be viewed and gives a powerful tool for cluster analysis. Anoth-er visualization technique can be used if the lattice is three-dimensional. The data points then can be mapped onto neu-ron r which can be identified by the color combination red, green and blue (Fig. 5) assigned to the location r. In such a way we are able to assign a color to each data point accord-
a 2 3 41
w1 w2 w3 w4
2 3 41
w1 w2 w3 w4
b
output space
output space
input space
input space
Figure 3. Values of the topographic product for the speech data. Figure 2. Metric vs. topological conditions for map topography.
1 2 3 4
0.1
0
-0.1
-0.2
-0.3
0.2
dA
The Eighth International Conference on Computing and Information Technology IC2IT 2012
100
ing to equation (1) and similar colors will encode groups of input patterns that were mapped close to one another in the lattice A. It should be emphasized that for a proper interpre-tation of this color visualization, as well as for the analysis of the U-matrix, topology preservation of the map M is a strict requirement. Furthermore, we should pay regard to the fact that the topology preserving property of M must be proven prior to any evaluation of the map.
B. A new Concept for Cluster Visualization We provide a new idea in order to get insight of visualiz-
ing clusters as disjoint objects in pairs of simple objects like 3D spheres, independently of the resulting admissible output space. In this manner, additionally to existing visualization methods, we are able to distinguish and illustrate the “vol-ume” of each cluster obtained by the radius of the construct-ed spheres.
In the following steps we describe our visualization ap-proach in further detail. At the very beginning the input data set is predefined as clustered data set after the GNG [11] learning process is finished. Afterwards the batch version of the SOM algorithm is performed whereas all BMUs are computed for all input clusters respectively. Finally, the dimension reduction of the input space is achieved by utiliz-ing the topographic product as a judgment tool for an admis-sible output space.
Affine spaces provide a better framework for doing ge-ometry. In particular, it is possible to deal with points, curves, surfaces, etc., in an intrinsic manner, i.e., inde-pendently of any specific choice of a coordinate system. Naturally, coordinate systems have to be chosen to finally carry out computations, but one should learn to resist the temptation to resort to coordinate systems until it becomes necessary. So, we treat the admissible output space as an affine space in intrinsic manner where no special origin is predefined. We set the origin neuron numbered with 1 (Fig. 6). For simplicity, in the neuronal grid, distances between all directly neighboring neurons are set to 1.
Let 𝐶! denote the power of a cluster 𝐶! (the number of entities for a given 𝐶!). We are aiming to construct a presen-tation space in homogenous form in the sense of space di-mension for any case of 𝑑! . We calculate the radius of spheres2 centered on corresponding BMUs as follows:
2 In our considerations we use the term of spheres for all cases of 𝑑! regarding the topology amongst them.
ri = 0.5 ⋅ 1−Ci
Cjj∑
$
%
&&&
'
(
)))
. (7)
Obviously, spheres constructed in that manner in the out-put space of dimension 𝑑! do not have any point in common. In our calculations we apply a parametric equation of a sphere. In order to keep the presentation space homogenous to dimension 3 (Fig. 7), with no relative topology at pres-ence, we extend the output space as described below.
In case of 𝑑! = 3 we perform no operation, since no ex-tension is needed (identity map). In case of 𝑑! = 2
𝑟 cos 𝑥 , 𝑟 sin 𝑥 ↦ 𝑟 cos 𝑥 , 𝑟 sin 𝑥 ,± 𝑟!! − 𝑟! , (8)
where 0 ≤ 𝑥 < 2𝜋, 0 ≤ 𝑟 ≤ 𝑟!, needs to be applied. Fi-nally, in case of 𝑑! = 1 (functions composition) the applica-tion of
𝑟 ↦ 𝑟 cos 𝑥 , 𝑟 sin 𝑥 ↦ 𝑟 cos 𝑥 , 𝑟 sin 𝑥 ,± 𝑟!! − 𝑟! , (9)
where 0 ≤ 𝑥 < 2𝜋, 0 ≤ 𝑟 ≤ 𝑟! , becomes necessary. In our method we propose to describe clusters as disjoint spheres’ centers located at every BMUs position respectively after the batch SOM algorithm is finished. In any cases of topology preservation criterion results (1, 2 or 3 - admissible dimension of neuronal net, after dimension reduction pro-cess) we are able to construct a group of disjoint spheres in 3D.
C. Comments The novelty of our approach is to present clusters via
suitable separated object – spheres in homogenous 3D presentation space. In contrast to the k-clustering concept [12] we apply modern Growing Neuronal Gas unsupervised learning process returning separated objects in form of a clustered probability distribution for a given input data set of possibly high dimension. Finally, we link this concept with Self-Organizing Maps framework in order to illustrate clus-ters in space of admissible reduced dimension. For compre-
Figure 5. Cluster visualization via U-Matrix. Figure 4. Representation of positions of neurons in the three-
dimensional neuron lattice A as a vector c=(r,g,b) in the color space C, where r, g, b denote the intensity of the colors red green and blue. Thus,
colors are assigned to categories (winner neurons).
The Eighth International Conference on Computing and Information Technology IC2IT 2012
101
hensive source on dimension reduction of high-dimensional data the reader is referred to [13].
IV. VISUALIZING CLUSTER INFORMATION VIA SOM ON MOBILE DEVICES
The following example will describe a realization of a SOM-based cluster visualization technique for information visualization, thus, displaying a semantic-based database index cluster structure on mobile platforms. The aim was to visually represent the internal database index organization structure intuitively to a user. Our realization had to focus on different requirements.
A. Requirements The implementation of a SOM-based cluster visualization
platform to display a database index’ cluster data on mobile entities had to fulfill certain requirements. First of all, the requirement to run our application on mobile devices with potentially low computational power was a challenge. Se-cond, the functionality of our application had to be ensured using any type of network connection provided by the mobile device also including mobile networks with low bandwidth. As a functional requirement, it was requested to visualize clusters as spheres, where the number of data tuples con-tained in each cluster should be presented implicitly.
B. Requirements Analysis Due to computational limitations of mobile platforms, the
possibility of running SOM transformations on a mobile device could not be regarded as feasible. Thus, a separation of our desired application into a client and a server part was regarded as the most promising solution. Based on the result of the analysis of our first requirement, we did not regard it as suitable to transmit all cluster data required for SOM computations. We decided to transmit only the results of the SOM process since this also seemed to guarantee a smaller data amount compared to the SOM’s input data. Further-more, we intended to reduce possible error causes with this decision regarding the possible necessity of different imple-mentations for different mobile platforms. Finally, the re-
quirements analyses led us to centralize computational effort, thus, utilizing the application on a mobile device only as interface for visualization and user interaction.
C. Realization We separated our application into two parts: a server ap-
plication, and a client application for mobile devices. As described in our requirements analysis we decided to central-ize computation effort on the server side, thus, realizing SOM computations there. For realizing the SOM computa-tions we made use of SOM Toolbox contained in Matlab® by building a bridge to C++ for enabling our server applica-tion to run the necessary SOM transformations easily. Using this tool chain allowed us to prepare the cluster data for visu-alization by dimension reduction through SOM efficiently.
The mobile application was designed to run on mobile platforms with touch interfaces but comparably low compu-tational resources. An example screen shot of our user inter-face is given in Fig. 8 showing clusters, i.e. spheres, that were transformed from n-dimensional space to 3-dimensional output space using SOM.
As shown in Fig. 8, the spheres are of different size. We decided to use a spheres size to implicitly visualize the num-ber of data tuples contained in its according cluster. For de-termining a sphere’s actual size we put the number of data tuples in a cluster into relation to the number of data tuples contained in all clusters. To prevent the spheres from inter-secting each other we decided to limit their size by regarding the minimum Euclidian distance δmin of each pair of spheres amongst all spheres into consideration. At a first glance we took the radius of a sphere into consideration for determining its size by making the radius proportionally dependent of the number of data tuples contained in the underlying cluster. Nevertheless, data is contained in a cluster, which leads us to the volume of spheres. Therefore, we decided to represent the number of tuples in a cluster by making a sphere’s volume dependent on these. Thus, we were able to implicitly repre-sent the data amount contained in a cluster.
Our example was based on a data set with 998 dimen-sions in input space.
D. Capabilities of our Example The software system presented in our example is capable
of visualizing information on the clustering state of a seman-tic based database index allowing the user to navigate through the index’ cluster structure. This may be performed either by using the visualization feature of the index’ hierar-chy or by utilizing the realized SOM-based visualization feature. In future development our aim is to present more
1 2 3
4 5 6
7 8 9
10 11 12
19 20 21
13 14 15
16 17 18
22 23 24
25 26 27
r1
BMU1BMU2
r2
r3
BMU3
r1>r3>r2; r1<0.5
Figure 6. Neurons and best matching units in a chosen admissible output space with the origin neuron intrinsically numbered with 1.
dA=3
dA=2
dA=1
Presentation SpaceAdmissible Output Space
Figure 7. Expansion of output space A to presentation space depending on admissible output space dimension dA.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
102
detailed information and to increase user interaction possibil-ities potentially influencing the clustering process.
V. CONCLUSION AND FURTHER WORK In our paper we have deeply described SOM from the
mathematical point of view, giving precise description for that kind of neuronal nets, emphasizing the role of topo-graphic product as a criterion for admissible neuronal net dimensions in dimension reduction process.
We have proposed a new illustration method for cluster visualization, linking existing visualization methods of colors (RGB) with methods of separated objects like 3D-spheres, providing better understanding of clusters as joint objects. Finally the software realization approach has been presented.
In our further research we will consider a data-driven ver-sion of SOM, so called growing SOM (GSOM). Its output is a structure adapted hypercube A, produced by adaptation of both the dimensions and the respective edge length ratios of A during the learning, in addition to the usual adaptation of the weights. In comparison to the standard SOM, the overall dimensionality and the dimensions along the individual di-rections in A are variables that evolve into the hypercube structure most suitable for the input space topology.
REFERENCES [1] G. Andreu, A. Crespo, and J. M. Valiente. Selecting the toroidal self-
organizing feature maps (tsofm) best organized to object recognition. In Proceedings of International Conference on Artificial Neural Networks, Houston (USA), volume 1327 of Lecture Notes in Computer Science, pages 1341–1346, June 1997.
[2] T. Martinetz and K. Schulten, “Topology representing networks”. Neural Networks, vol. 7, no. 3, pp. 507–522, 1994.
[3] T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation in Self–Organizing Feature Maps: Exact Definition and Measurement. IEEE Transactions on Neural Networks, 8(2):256–266, 1997.
[4] S. Kaski, J. Nikkilä, and T. Kohonen. Methods for interpreting a self-organized map in data analysis. In Proc. Of European Symposium on Arti cial Neural Networks (ESANN’98), pages 185–190, Brussels, Belgium, 1998. D facto publications.
[5] A. Ultsch. Self organized feature maps for monitoring and knowledge aquisition of a chemical process. In S. Gielen and B. Kappen, editors, Proc. ICANN’93, Int. Conf. on Artifcial Neural Networks, pages 864–867, London, UK, 1993. Springer.
[6] J. Vesanto. SOM-based data visualization methods. Intelligent Data Analysis, 3(7):123–456, 1999.
[7] T. Kohonen. Self-Organizing Maps. Springer, Berlin, Heidelberg, 1995. (Second Extended Edition 1997).
[8] H.U. Bauer, and K.Pawlzik, Quantifying the neighborhood preservation of self-organizing feature maps, IEE Trans. Of Neur. Netw. 3 (4), 570-579 (1992)
[9] T. Gramss, H.W. Strube, Recognition of Isolated Words Based on Psychoacoustics and Neurobiology. Speech. Comm. 9, 35-40, 1990.
[10] T. Kohonen, “Self organization and associative Memory”, 2nd Edition, Berlin, Germany: Springer-Verlag, 1988.
[11] Fritzke, B. (1995a). A growing neural gas network learns topologies. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 625-632. MIT Press, Cambridge MA.
[12] Preparata and Shamos, “Computational geometry, an introduction”, Springer-Verlag, 1985.
[13] John A. Lee, Michel Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007.
C113
C110
x
y
z
C112
C107
C
C108
Figure 8. Visualization of clusters in three-dimensional output space after applying SOM.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
103
Detecting Source Topics using Extended HITS
Mario Kubek Faculty of Mathematics and Computer Science
FernUniversität in Hagen Hagen, Germany
Email: [email protected]
Herwig Unger Faculty of Mathematics and Computer Science
FernUniversität in Hagen Hagen, Germany
Email: [email protected]
Abstract—This paper describes a new method to determine the sources of topics in texts by analysing their directed co-occurrence graphs using an extended version of the HITS algorithm. This method can also be used to identify characteristic terms in texts. In order to obtain the needed directed term relations to cover asymmetric real-life relationships between concepts it is described how they can be calculated by statistical means. In the experiments, it is shown that the detected source topics and characteristic terms can be used to find similar documents and those that mainly deal with the source topics in large corpora like the World Wide Web. This approach also offers a new way to follow topics across multiple documents in such corpora. This application will be elaborated on as well.
Keywords-Source topic detection; Co-occurrence analysis; Extended HITS; Text Mining; Web Information Retrieval
I. INTRODUCTION AND MOTIVATION The selection of characteristic and discriminating
terms in texts through weights, often referred to as keyword extraction or terminology extraction, plays an important role in text mining and information retrieval. In [1] it has been pointed out, that graph-based methods are well suited for the analysis of co-occurrence graphs e.g. for the purpose of keyword extraction and deliver comparable results to classic approaches like TF-IDF [2] and difference analysis [3]. Especially the proposed extended version of the PageRank algorithm, that takes into account the strength of the semantic term relations in these graphs, is able to return such characteristic terms and does not rely on reference corpora. In this paper, the authors extend this approach by introducing a method to not only determine these keywords, but to also determine terms in texts that can be referred to as source topics. These terms strongly influence the main topics in texts, yet are not necessarily important keywords themselves. They are helpful when it comes to applications like following topics to their roots by analysing documents that cover them primarily. This process can span several documents.
In order to automatically determine source topics of single texts, the authors present the idea to apply an extended version of the HITS algorithm [4] on directed co-occurrence graphs for this purpose. This solution will not only return the most characteristic terms of texts like the extended PageRank algorithm, but also the source topics in them. Usually, co-
occurrence graphs are undirected which is suitable for the flat visualisation of term relations and for applications like query expansion via spreading activation techniques. However, real-life associations are mostly directed, e.g. an Audi a German car but not every German car is an Audi. The association of Audi with German car is therefore much stronger than the association of German car with Audi. Therefore, it actually makes sense to deal with directed term relations.
The HITS algorithm [4], which was initially designed to evaluate the relative importance of nodes in web graphs (which are directed), returns two list of nodes: authorities and hubs. Authorities that are also determined by the PageRank algorithm [5], are nodes that are often linked to by many other nodes. Hubs are nodes that link to many other nodes. Nodes are assigned both a score for their authority and their hub value. For undirected graphs the authority and the hub score of a node would be the same, which is naturally not the case for the web graph. Referred to the analysis of directed co-occurrence graphs with HITS, the authorities are the characteristic terms of the analysed text, whereas the hubs represent its source topics. Therefore, it is necessary to describe the construction of directed co-occurrence graphs before getting into the details of the method to determine the source topics and its applications.
Hence, the paper is organised as follows: the next section explains the methodology used. In this section it is outlined, how to calculate directed term relations from texts by using co-occurrence analysis in order to obtain directed co-occurrence graphs. Afterwards, section three presents a method that applies an extended version of the HITS algorithm that considers the strength of these directed term relations to calculate the characteristic terms and source topics in texts. Section four focuses on the conducted experiments using this method. It is also shown that the results of this method can be used to find similar and related documents in the World Wide Web. Section five concludes the paper and provides a look at options to employ this method in solutions to follow topics in large corpora like the World Wide Web.
II. METHODOLOGY Well known measures to gain co-occurrence significance values on sentence level are for instance
The Eighth International Conference on Computing and Information Technology IC2IT 2012
104
the mutual information measure [6], the Dice [7] and Jaccard [8] coefficients, the Poisson collocation measure [9] and the log-likelihood ratio [10]. While these measures return the same value for the relation of a term A with another term B and vice versa, an undirected relation of both terms often does not represent real-life relationships very well as it has been pointed out in the introduction. Therefore, it is sensible to deal with directed relations of terms. To measure the directed relation of term A with term B, which can also be regarded as the strength of the association of term A with term B, the following formula of the conditional relative frequency can be used, whereby is the number of times term A and B co-occurred in the text on sentence level and
is the number of sentences term A occurred in:
(1)
Often, this significance differs greatly in regards of the two directions of the relations when the difference of the involved term frequencies is high. The association of a less frequently occurring term A with a frequently occurring term B could reach a value of 1.0 when A always co-occurs with B, however B's association with A could be almost 0. This means, that B's occurrence with term A is insignificant in the analysed text. That is why it is sensible to only take into account the direction of the dominant association (the one with the higher value) to generate a directed co-occurrence graph for the further considerations. However, the dominant association should be additionally weighted. In the example above, term A's association with B is 1.0. If another term C, which more frequently appears in the text than A, also co-occurs with term B each time it appears, then its association value with B would be 1.0, too. Yet, this co-occurrence is more significant than the co-occurrence of A with B. An additional weight that influences the association value and considers this fact could be determined by
• the (normalised) number of sentences, in which both terms co-occur or
• the (normalised) frequency of the term A. The normalisation basis could be the maximum number of sentences, which any term of the text has occurred in.
The association Assn of term A with term B can then be calculated using the second approach by:
(2)
Hereby, is the maximum number of sentences, any term has occurred it. A thus obtained relation of term A with term B with a high association strength can be interpreted as a recommendation of A for B. Relations gained by this means are more specific than undirected relations between terms because of their direction. They resemble a hyperlink on a website to another one. In this case however, it has not been manually and explicitly set and it carries an additional weight that indicates the strength of the term association. The set of all such relations obtained from a text represents a directed co-occurrence graph. The next step is now to analyse such graphs with an extended version HITS algorithm that regards these association strengths in order to find the source topics in texts. Therefore, in the next section the extension of the HITS algorithm is explained and a method that employs it for the analysis of directed co-occurrence graphs is outlined.
III. THE ALGORITHM With the help of the knowledge to generate
directed co-occurrence graphs it is now possible to introduce a new method to analyse them in order to find source topics in the texts they represent. For this purpose the application of the HITS algorithm on these graphs is sensible due to its working method that has been outlined in the introduction. The list of hub nodes in these graphs returned by HITS contain the terms that can be regarded as the source topics of the analysed texts as they represent their inherent concepts. Their hub value indicates their influence on the most important topics and terms that can be found in the list of authorities.
For the calculation of these lists using HITS, it is also sensible to also include the strength of the associations between the terms. These values should also influence the calculation of the authority and hub values. The idea behind this approach is that a random walker is likely to follow links in co-occurrence graphs that lead to terms that can be easily associated with the current term he is visiting. Nodes that contain terms that are linked with a low association value however should not be visited very often. This also means that nodes that lie on paths with links of high association values should be ranked highly as they can be reached easily. Therefore, the formulas for the update rules of the HITS algorithm can be modified to include the association values Assn. The authority value of node x can then be determined using the following formula:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
105
(3) Accordingly, the hub value of node x can be
calculated using the following formula:
(4)
The following steps are necessary to obtain a list for the authorities and hubs based on these update rules:
1. Remove stopwords and apply stemming
algorithm on all terms in the text. (Optional)
2. Determine the dominant association for all co-occurrences using formula 1, apply the additional weight on it according to formula 2 and use the set of all these relations as a directed co-occurrence graph G.
3. Determine the authority value a(x) and the hub value h(x) iteratively for all nodes x in G using the formulas 3 and 4 until convergence is reached (the calculated values do not change significantly in two consecutive iterations) or a fixed number of iterations has been executed.
4. Order all nodes descendingly according to their authority and hub values and return these two ordered lists with the terms and their authority and hub values.
Now, the effectiveness of this method will be illustrated by experiments.
IV. EXPERIMENTS
A. Detection of Authorities and Hubs The following tables show for two documents of
the English Wikipedia the lists of the 10 terms with the highest authority and hub values. To conduct these experiments the following parameters have been used:
• removal of stopwords • restriction to nouns • baseform reduction • activated phrase detection
TABLE I. TERMS AND PHRASES WITH HIGH AUTHORITY AND HUB VALUES OF THE WIKIPEDIA-ARTICLE ”LOVE”:
Term Authority value Term/Phrase Hub value
love 0.54 friendship 0.19
human 0.30 intimacy 0.17
god 0.29 passion 0.14
attachment 0.26 religion 0.14
word 0.21 attraction 0.14
form 0.21 platonic love 0.13
life 0.20 interpersonal love 0.13
feel 0.18 heart 0.13
people 0.17 family 0.13
buddhism 0.14 relationship 0.12
TABLE II. TERMS AND PHRASES WITH HIGH AUTHORITY AND HUB VALUES OF THE WIKIPEDIA-ARTICLE ”EARTHQUAKE”:
Term Authority value Term/Phrase Hub value
earthquake 0.48 movement 0.18
earth 0.30 plate 0.16
fault 0.27 boundary 0.15
area 0.23 damage 0.15
boundary 0.18 zone 0.15
plate 0.16 landslide 0.14
structure 0.16 seismic activity 0.14
rupture 0.15 wave 0.13
aftershock 0.15 ground rupture 0.13
tsunami 0.14 propagation 0.12
The examples show that the extended HITS
algorithm can determine the most characteristic terms (authorities) and source topics (hubs) in texts by analysing their directed co-occurrence graphs. Especially the hubs provide useful information to find suitable terms that can be used as search words in queries when background information is needed to a specific topic. However, also the terms found in the authority lists can be used as search words in order to find similar documents. This will be shown in the next subsection.
B. Search Word Extraction The suitability for these terms as search words
will now be shown. For this purpose, the five most important authorities and the five most important hubs of the Wikipedia article "Love" have been combined as search queries and sent to Google. These results have been obtained using the determined authorities:
The Eighth International Conference on Computing and Information Technology IC2IT 2012
106
Figure 1: Search results for the authorities of the Wikipedia article ”Love”
The search query containing the hubs of this article will lead to these results:
Figure 2: Search results for the hubs of the Wikipedia article ”Love”
These search results clearly show, that they
primarily deal with either the authorities or the hubs. More experiments confirm this correlation. Using the authorities as queries to Google it is possible to find similar documents to the analysed one in the Web. Usually, the analysed document itself is found among the first search results, which is not surprising though. However, it shows that this approach could be a new
way to detect plagiarised documents. It is also interesting to point out the topic drift in the results when the hubs have been used as queries. This observation indicates that the hubs of documents can be used as means to follow topics across several related documents with the help of Google. This possibility will be elaborated on in more detail in the next and final section of this paper.
V. CONCLUSION In this paper, a new graph-based method to
determine source topics in texts based on an extended version of the HITS algorithm has been introduced and described in detail. Its effectiveness has been shown in the experiments. Furthermore, it has been demonstrated that the characteristic terms and the source topics that this method finds in texts, can be used as search words to find similar and related documents in the World Wide Web. Especially the determined source topics can lead users to documents that primarily deal with these important aspects of their originally analysed texts. This goes beyond a simple search for similar documents as it offers a new way to search for related documents, yet it is not impossible to find similar documents when the source topics are used in queries. This functionality can be seen as a useful addition to Google Scholar (http://scholar.google.com/), which offers users the possibility to search for similar scientific articles. Additionally, interactive search systems can employ this method to provide their users functions to follow topics across multiple documents. The iterative use of source topics as search words in found documents can provide a basis for a fine-grained analysis of topical relations that exist between the search results of two consecutive queries. Documents found in later iterations in suchlike search sessions can give users valuable background information on the content and topics of their originally analysed documents. Another interesting application for this method can be seen in the automatic linking of related documents in large corpora. If a document A primarily deals with the source topics of another document B, then a link from A to B can be set. This way, the herein described approach to obtain directed term associations is modified to gain the same effect on document level, namely to calculate recommendations for specific documents. These automatically determined links can be very useful in terms of positively influencing the ranking of search results, because these links represent semantic relations between documents that have been verified in contrast to manually set links e.g. on websites, which additionally can be automatically evaluated regarding their validity by using this approach. Also, these
The Eighth International Conference on Computing and Information Technology IC2IT 2012
107
automatically determined links provide a basis to rearrange returned search results based on the semantic relations between them. These approaches will be examined in later publications in detail.
REFERENCES [1] M. Kubek and H. Unger: Search Word Extraction Using
Extended PageRank Calculations, In Herwig Unger, Kyandoghere Kyamaky, and Janusz Kacprzyk, editors, Autonomous Systems: Developments and Trends, volume 391 of Studies in Computational Intelligence, pages 325–337, Springer Berlin / Heidelberg, 2012
[2] G. Salton, A. Wong and C.S. Yan: A vector space model for automatic indexing, Commun. ACM, 18:613–620, November 1975
[3] G. Heyer, U. Quasthoff and Th. Wittig: Text Mining – Wissensrohstoff Text, W3L Verlag Bochum, 2006
[4] J. M. Kleinberg: Authoritative Sources in a Hyperlinked Environment, In Proc. of ACM-SIAM Symp. Discrete Algorithms, San Francisco, California, pages 668–677, January 1998
[5] L. Page, S. Brin, R. Motwani and T. Winograd: The pagerank citation ranking: Bringing order to the web, Technical report, Stanford Digital Library Technologies Project, 1998
[6] M. Buechler: Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturierten Daten, Master’s thesis, University of Leipzig, 2006
[7] L. R. Dice: Measures of the Amount of Ecologic Association Between Species, Ecology, 26(3):297–302, July 1945
[8] P. Jaccard: Étude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547–579, 1901
[9] U. Quasthoff and Chr. Wolff: The Poisson Collocation Measure and its Applications, In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien, 2002
[10] T. Dunning: Accurate methods for the statistics of surprise and coincidence, Computational Linguistics, 19(1):61–74, 1994
The Eighth International Conference on Computing and Information Technology IC2IT 2012
108
Blended value based e-business modeling approach: A sustainable approach using QFD
Mohammed Naim A. Dewan Curtin Graduate School of Business
Curtin University Perth, Australia
Mohammed A. Quaddus Curtin Graduate School of Business
Curtin University Perth, Australia
Abstract—‘E-business’ and ‘sustainability’ are the two current major global trends.But surprisinglynone of the e-business modeling ideas covers the sustainability aspects of the business. Recently researchers are introducing ‘green IS/IT/ICT’ concept but none of them clearly explains how those concepts will be accommodated inside the e-business models. This research approach, therefore, aims to develop an e-business model in conjunction with sustainability aspects. The model explores and determines the optimal design requirements in developing an e-business model. This research approach also investigates how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model. This modeling approach is unique in the sense that in developing the model sustainability concept is integrated with customer’s value requirements, business’s value requirements, and process’s value requirements instead of only customer’s requirements. QFD, AHP, and Delphi method are used for the analysis of the data. Besides developing the blended value based e-business model this research approach also develops a framework for modeling e-business in conjunction with blended value and sustainability which can be implemented by almost any other businesses in consideration with the business contexts.
Keywords- E-business, Business model, Sustainability, Blended value, QFD, AHP.
I. INTRODUCTION Business modeling is not new and has had significant
impacts on the way businesses are planned and operated today. Whilst business models exist for several narrow areas, broad comprehensive e-business models are still very informal and generic. Majority of the business modeling ideas considers only economic value aspects of the business and do not focus on social or environmental aspects. It is surprising that although ‘e-business’ and ‘sustainability’ are the two current major global trends but none of the e-business modeling ideas covers the sustainability aspects of the business. Researchers are now introducing ‘green IS/IT/ICT’ concept but none of them clearly explains how those concepts will be accommodated inside the e-business models. Therefore, this research approach aims to develop an e-business model in conjunction with sustainability aspects. The model will be based on ‘blended value’ and will explore and determine the optimal design requirements in developing an e-business model. This research approach will also investigate how the sustainability dimensions can be integrated with the value dimensions in developing an e-
business model. This modeling approach is distinct in the sense that in developing the model sustainability concept is integrated with customer’s value requirements, business’s value requirements, and process’s value requirements instead of only customer’s requirements. For the analysis of the dataQuality Function Deployment(QFD), Analytic Hierarchy Process (AHP), and Delphi method are used. Besides developing the blended value based e-business model this research approach will also develop a framework for modeling e-business in conjunction with blended value and sustainability which can be implemented by almost any other businesses in consideration with the business contexts.The following section clarifies the purpose of the approach. Definition of terms used in the approach is in Section 3. Anextensive literature review is covered in Section 4. Section 5 and 6 explicate the research methodology and the research process respectively. Research analysis is explained in Section 7, and finally, Section 8 concludes the article with a discussion.
II. PURPOSE OF THE APPROACH The majority of research into business models in the IS
field has been concerned with e-business and e-commerce [1]. There exist a number of ideas about e-business models but most of them provide only conceptual overview and concentrate only on economic aspects of the business. None of the e-business modeling ideas exclusively considers the sustainability aspects. Similarly, there is a growing number of literature available about the sustainability of businesseswhich do not focus on e-business. But the intersection of these two global trends, e-business and sustainability, need to be addressed. Although recently a very few researchers talks about green IT/IS/ICT concept but none of them clearly explains how that concept will fit in an e-business model to make it sustainable and at the same time, to protect the interests of the customers. This research approach will develop an e-business model based on ‘blended value’ which will be sustainable and will safeguard the interests of the customers.The ‘blended value’ requirements will identify and select the ‘optimal design requirements’ necessary to be implemented for the sustainability of the businesses. Therefore, the main research questions of the approach are as follows:
Q1. What are the optimal/appropriate design requirements in developing an e-business model?
The Eighth International Conference on Computing and Information Technology IC2IT 2012
109
Q2.How the sustainability dimensions can be integrated with the value dimensions in developing an e-business model?
Based on the above research questions this research approach is consists of the following objectives:
• To explore and determine the optimal design requirements of an e-business model.
• To investigate how the concept of blended value dimensions can be used in developing an e-business model.
• To investigate how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model.
• To develop a ‘value-sustainability’ framework for modeling e-business in conjunction with blended value and sustainability concepts.
III. DEFINITION OF TERMS
A. Blended Value Blended value is the integration of economic value, social
value, and environmental value for customers, businesses, and value processes. It is different from CSR value in the sense that CSR value is separate from profit maximization and agenda is determined by external reporting, whereas blended value is integral to profit maximization and agenda is company specific and internally generated.
B. Value Requirements Value requirements are the demands for the value by
customers (for satisfaction), businesses (for profit), and business processes (for efficient value process). Value can be economic and/or social and/or environmental demanded by customers and/or businesses and/or business processes to fulfill the customer’s requirements and/or to achieve strategic goals and/or to ensure efficient value processes.
C. Design Requirements Design requirements also known as HOWs are the
requirements required to fulfill the ‘blended value’ requirements in QFD process. After needs are revealed the company’s technicians or product development team develop a set of design requirements in measurable and operable technical terms [2] to fulfill the value requirements.
IV. LITERATURE REVIEW
A. Business model and e-business model Scholars have referred to business model as a statement, a
description, a representation, an architecture, a conceptual toolormodel, a structural template, a method, a framework, a pattern, and as a set found by Zott et al. [3]. A study by Zott et al. [3] found that in a total of 49 conceptual studies in which the business model is clearly defined, almost one fourth of the studies are related to e-business.The majority of research into business models in the IS field has been concerned with e-business and e-commerce; and there have been some attempts to develop convenient classification schemas [1]. For example, definitions, components, and classifications into e-business
models have been suggested [4, 5].Timmers [6] was the first who defined e-business model in terms of the elements and their interrelationships. Applegate [7] introduces the six e-business models: focused distributors, portals, producers, infrastructure distributors, infrastructure portals, and infrastructure producers. Weill and Vitale [8]suggest a subdivision into so called atomic e-business models, which are analyzed according to a number of basic components. There exist few more e-business modeling approaches, such as, Rappa [9],Dubosson-Torbay et al. [10],Tapscott, Ticoll and Lowy [11], Gordijn and Akkermans [12], and more.But sustainability concept is still entirely absent in all of the e-business modeling ideas.
B. Sustainability of Business Sustainable business means a business with “dynamic
balance among three mutually inter dependent elements: (i) protection of ecosystems and natural resources; (ii) economic efficiency; and (iii) consideration of social wellbeing such as jobs, housing, education, medical care and cultural opportunities” [13]. Even though many scholars enlightened their study on sustainability incorporating economic, social, and environmental perspective but still “most companies remain stuck in social responsibility mind-set in which societal issues are at the periphery, not the core. The solution lies in the principle of shared (blended) value, which involves creating economic value in a way that also creates value for society by addressing its needs and challenges” [14]. Moreover, most of the scholars mainly express the needs for blended value and very few of them provide with only hypothetical ideas for maintaining sustainability. A complete business model for sustainability with operational directions is still lacking.
C. E- business and Sustainability E-business is the point where economic value creation and
information technology/ICT come together [15]. ICT can have both positive and negative impact on the society and the environment. Positive impacts can come from dematerialization and online delivery, transport and travel substitution, a host of monitoring and management applications, greater energy efficiency in production and use, and product stewardship and recycling; and negative impacts can come from energy consumption and the materials used in the production and distribution of ICT equipment, energy consumption in use directly and for cooling, short product life cycles and e-waste, and exploitative applications [16]. Technology is a source of environmental contamination during product manufacture, operation, and disposal [17-19]. Corporations have the knowledge, resources, and power to bring about enormous positive changes in the earth’s ecosystems”[20].In consistent with the definition of environmental sustainability of IT [21], sustainability of e-business can be defined as the activities within the e-business domain to minimize the negative impacts and maximize the positive impacts on the society and the environment through the design, production, application, operation, and disposal of information technology and information technology-enabled products and services throughout their life cycle.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
110
V. RESEARCH METHODOLOGY In this approach, initially ‘a sustainable e-business
modeling approach based on blended value’ is proposed after considering the previous literature and the research objectives. This proposed model can be tested with the sample data to justify its capability and validity along with the progress of the research. Any businesscan be chosen for data collection. Sample data can be collected from field study by conducting semi-structured interviews with the customers and through focus group meetings with the dept-in-charges. Once the model’s capability is proven, large volume of data will be collected from the customers and the organisations by organizing surveys and focus group meetings to test the comprehensive model. Therefore, both qualitative and quantitative methods will be used in this research approach for data collection and analysis.
A. Research Elements This research approach uses ‘blended value requirements’
and ‘sustainability’ as the main elements. According to our approach, blended value is consists of three values: customer value, business value, and process value. Sustainability of business includes economic value, social value, and environmental value. Therefore, to be competitive in the market the value need to be measured from three dimensions:
− What total value is demanded by the customers? − What total value is required by the businesses based
on their strategy to reach their goals? − What process value is required by the businesses to
have an efficient and sustainable value processes?
Consequently, based on the measurement from three dimensions blended value requirements can be categorised into 9 (nine) groups which will be used as the main elements of this approach. They are as follows:
1) Economic value for customer requirements:This means any of the customer’s value requirements which is somehow economically related directly or indirectly to the product or service that is to be delivered to the customer. In other words, these requirements mean all types of economic benefits that the customers are looking for. For example, price of the product or service, quality, after-sales-service, availability or ease of access, delivery, etc. appear under this category.
2) Social value for customer requirements:Social value requirements for the customer include any value delivered by the businesses for the customer’s society. These social value requirements are not the social responsibilities that the business organisations are thinking to perform, rather these are the requirements that the customers are expecting or indirectly demanding for their society from the products or services or from the supplier of the products or services.
3) Environmental value for customer requirements:Environmental value requirements stand for all the environmental factors related directly or indirectly, to the product or service delivered to the customer or they can be somehow related to the operations of supplier of the product or service, such as, emissions (air, water, and soil), waste,
radiation, noise, vibration, energy intensity, material intensity, heat, direct intervention on nature and landscape, etc [22]. This environmental value is demanded or expected by the customers.
4) Economic value for business requirements:These requirements are those requirements which add some economic value to the business directly or indirectly if they are fulfilled. These economic requirementsare not demanded by the customers instead they are identified by the businesses to be fulfilled to achieve the planned future goals. For example, reducing the cost of production, increase of sales and/or profit, getting cheaper raw materials, minimizing packaging and delivery cost, replacing the employees with more efficient machinery, etc.
5) Social value for business requirements: Social value requirements are to add some value to the society from business’s point of view if they are fulfilled. These value requirements reflect what social value the business is planning and willing to deliver to the customers’ society in time regardless of the customers’ demand. For instance, Lever Bros Ltd. uses few principles to focus on social value, such as, emphasising on employees’ personal development, training, health, and safety; improving well-being of the society at large, etc. [23].
6) Environmental value for business requirements: Adding environmental value can be a competitive advantage for the businesses since businesses can differentiate themselves by creating products or processes that offer environmental benefits. By implementing environmental friendly operations businesses may achieve cost reductions, too. For example, reduced contaminations, recycling of materials, improved waste management, minimize packaging, etc., reduce the impact on the environment and the costs.
7) Economic value for process requirements:These are mainly related to the cost savings within the existing value processes which can be later transferred to the customers. The managers identify these value creating inefficiencies within the existing processes and try to correct them which result in some sort of economic benefits. For example, up-to-date technologies, adequate amount of training, using efficient energies, improved supply chain management systems, etc. can increase the efficiency of the value processes that can certainly add some economic value to the organisation.
8) Social value for process requirements: To identify these requirements managers look at the whole value process of the organisation and see whether there is any scope to add some value to the society they are operating within the existing value process systems. For instance, educating disadvantaged children, organising skills training for unemployed people, employing disabled people, establishing schools and colleges, sponsoring social events, organising social gathering, organising awareness programs etc. can add value to the society and most of these requirements can be easily fulfilled by the businesses without or with a little investments or efforts.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
111
Figure 1: Research approach.
9) Environmental value for process requirements:To fulfil these requirements, the businesses try to find and implement all the necessary steps within the existing value processes that will stop or reduce the chances of negative impacts and facilitate positive impacts on the environment, thus, adding some value to the environment. For example, leakage of water/oil/heat, inefficient disposal and recycling of materials, unplanned pollution (air, water, sound) management, heating and lighting inefficiency, etc. within the existing value processes result damages to the environment. Thus these requirements need to be fulfilled to minimize the impact of current value processes on the environment.
B. Research Tools 1) Quality Function Deployment (QFD):QFD supports
the product design and development process, which was laid out in the late 1960s to early 1970s in Japan by Akao [24]. QFD is based on collecting and analysing the voice of the customer that help to develop products with higher quality and meeting customer needs [25]. Therefore, it can be also used to analyse business needs and value process needs. The popular application fields of QFD are product development, quality management and customer needs analysis; however, the utilisation of QFD method has spread out to other manufacturing fields in time [26]. Recently, companies are successfully using QFD as a powerful tool that addresses strategic and operational decisions in businesses [27]. This tool is used in various fields for determining customer needs, developing priorities, formulating annual policies, benchmarking, environmental decision making, etc. Chan and Wu [26] and Mehrjerdi [27] provide a long list of areas where QFD has been applied. QFD, in this approach, will be applied as the main tool to analyse customer needs, business needs, and process value needs. It will also be used to develop and select optimised design requirements based on organisation’s capability to satisfy the blended value requirements for the sustainability of the businesses.
QFD, in this approach, will be applied as the main tool to address customers’ requirements (CRs) and integrate those requirements into design requirements (DRs) to meet the sustainability requirements of buyers and stakeholders.In QFD modeling, ‘customer requirements’ are referred as WHATs and ‘how to fulfil the customer’s requirements’ are referred as HOWs. The process of using appropriate HOWs to
meet the given WHATs is represented as a matrix (Fig. 2) Different users build different QFD models involving different elements but the most simple and widely used QFD model contains at least the customer requirements (WHATs) and their relative importance, technical measures or design requirements (HOWs) and their relationships with the WHATs, and the importance ratings of the HOWs. Six sets of input information is required in a basic QFD model: (i) WHATs: attributes of the product as demanded by the customers, (ii) IMPORTANCE: relative importance of the above attributes as perceived by the customers, (iii) HOWs: design attributes of the product or the technical descriptors, (iv) Correlation Matrix: interrelationships among the design requirements, (v) Relationship Matrix: relationships between WHATs and HOWs (strong, medium or weak), and (vi) Competitive Assessment: assessment of customer satisfaction with the attributes of the product under consideration against the product produced by its competitor or the best manufacturer in the market [32]. The following steps are followed in a QFD analysis:
Step 1: Customers are identified and their needs are collected as WHATs; Step 2: Relative importance ratings of WHATs are determined; Step 3: Competitors are identified, customer competitive analysis is conducted, and customer performance goals for WHATs are set; Step 4: Design requirements (HOWs) are generated; Step 5: Correlation between design requirements (HOWs) are determined; Step 6: Relationships between WHATs and HOWs are determined; Step 7: Initial technical ratings of HOWs are determined; Step 8: Technical competitive analysis is conducted and technical performance goals for HOWs are set; Step 9: Final technical ratings of HOWs are determined.
Lastly, based on the rankings of weights of HOWs the design requirements are selected.
Figure 2: QFD layout.
2) Analytic Hierarchy Process (AHP):Saaty [28] developed analytic hierarchy process which is an established multi-criteria decision making approach that employs aunique
The Eighth International Conference on Computing and Information Technology IC2IT 2012
112
method of hierarchical structuring of a problem and subsequent ranking of alternative solutions by a paired comparison technique. The strengths of AHP is lied on its robust and well tested method of solution and its capability ofincorporating both quantitative and qualitative elements in evaluating alternatives [29]. AHP is a powerful and widely-used multi-criteria decision-making technique for prioritizingdecision alternatives of interest [30]. AHP is frequently used in QFD process, for instance, Han et al. [31], Das and Mukherjee [29], Park and Kim [30], Mukherjee [32], Bhattacharya et al.[33], Chan and Wu [34], Han et al. [31], Xie et al. [35], Wang et al. [36] and more. In this research approach, AHPwill be used to prioritize the blended value requirementsbefore developing design requirements inQFD process basedon customer value requirements,business value requirements, and process value requirements.
3) Delphi Method:The Delphi method has proven a popular tool in information systems (IS) research [37-42]which was originally developed in the 1950s by Dalkey and his associates at the Rand Corporation [43]. Okoli and Pawlowski [42] and Grisham [43] provide with the lists of examples of research areas where Delphi was used as the major tool. This research approach will use Delphi method in designing and selecting optimised design requirements for the company in QFD process to develop the blended value based e-business model.
VI. RESEARCH PROCESS Data will be collected from face to face interviews and
structured focus group meetings. In this stage, blended value requirements (economic value, social value, and environmental value for customer’s requirements, business requirements and value process requirements) for particular products will be identified based on the existingvalue proposition, value process and value delivery. Customer requirements will be identified through open-ended semi-structured questionnaires. Business requirements and value process requirements will be identified through focus group meetings with the dept-in-charges. Required number of questionnaires will be collected from thecustomers and based on the feedback from the customersnecessarydata will be collected from structured focus group meetings. Collected data will be analyzed using AHP and QFD. There are few steps that will be used to complete the data analysis: (i) The blended value requirements will be grouped and categorized into classifications based on the type of requirements. Then they will be prioritized using AHP to find out the importance level of each of the requirements; (ii) The target level for each of the total requirement will be set depending on the importance level of the each requirement and the organisation’s capability and strategy. After prioritizing, total requirements will be benchmarked, if necessary, to set the target levels of the requirements; (iii) Based on the target levels of each requirements design requirements will be developed. Design requirements will be developed through Delphi method after structured discussion or focus group meeting with the related dept-in-charges. Design requirements will be benchmarked, if necessary, before setting target values for those requirements. Also, costs will be determined for elevating each design requirement; (iv) A relationship matrix between blended value requirements and design requirements will be
developed using QFD to get the weights of the each design requirement. Then based on the weights (how much each design requirement contributes to meeting each of the total requirements) certain design requirements will be selected initially; (v) Then trade-offs among the initially selected design requirements will be identified for cost savings since improving one design requirement will have a positive, negative, and/or no effect on other design requirements; (vi) Finally, design requirements will be chosen based on the following criteria: − initial technical ratings based relationship matrix between
total requirements and design requirements; − technical priorities depending on organisation’s capability,
and − trade-offs among the design requirements.
VII. RESEARCH ANALYSIS
In QFD process the relationship between a blended valuerequirement (BVR) and a design requirement(DR) is described as Strong, Moderate, Little, or No relationship which are later replaced by weights (e.g. 9, 3, 1, 0) to give the relationship values needed to make the design requirement importance weight calculations. These weights are used to represent the degree of importance attributed to the relationship. Thus, as shown in Table 1, the importance weight of each design requirement can be determined by the following equation: = ∑ ∀ , = 1,…… , (1)
Where, =importance weight of the wth design requirement; =importance weight of the ith blended value requirement; =relationship value between the ith blended value requirement andw th design requirement; = number of design requirements; =number of blended value requirements.
InTable 1, customer requirements, business requirements, and process requirements are considered as part of the blended value requirements. The importance weight of the blended value requirements will be calculated using AHP after getting data from the customers, businesses and the importance weightof the design requirements will be decided by the managers through Delphi method. According to the QFD matrix the absolute importance of the blended value requirements can be determined by the following equation: = ∑ ∀ , = 1,…… , (2)
Where, =absolute importance of the ith blended value requirement (BVR ); =importance weight of theith blended value requirement; =importance weight of the wth design requirement;
Therefore, the absolute importance for the 1st blended value requirements (BVR )will be: = + R D + … . . + R D
The Eighth International Conference on Computing and Information Technology IC2IT 2012
113
TABLE I. QFD MATRIX
Requirements DR DR ..... DR A. I. R. I.
CRs
BVR R D R D ..... R D AI RI BVR R D R D ..... R D AI RI
…. ….
….
….
….
….
…. BVR R D R D ..... R D AI RI
BRs
BVR R D R D ..... R D AI RI BVR R D R D ..... R D AI RI
… ….
….
….
….
….
…. BVR R D R D ..... R D AI RI
PRs
BVR R D R D ..... R D AI RI BVR R D R D ..... R D AI RI
… ….
….
….
….
….
…. BVR R D R D ..... R D AI RI
A. I. AI AI …. AI
R. I. RI RI …. RI
Note: A.I.= Absolute importance; R.I.= Relative importance; DR= Design requirements; CR= Customer requirements; BR= Business requirements; PR= Process requirements; BVR= Blended value requirements.
Thus, the relative importance of the 1st blended value requirements (BVR )will be: = ∑ (3)
Where, =relative importance of the 1st blended value requirement ( ); =absolute importance of the 1stblended value requirement ( );
Similarly, the absolute importance and the relative importance of all other blended value requirements can be determined by following the Equation (2) and (3). Therefore, the absolute value for the first design requirements ( )will be: = + + … . . + In the same way, the relative importance of the 1stdesign requirements can be determined by the following equation: = ∑ (4)
Where, =relative importance of the 1st design requirement ( );
=absolute importance of the 1st design requirement ( );
The absolute importance and the relative importance of all other design requirements can be determined by following the Equation (1) and (4). Once the absolute importance and relative importance of the blended value requirements and design requirements are determined, the cost trade-offs will be identified through correlation matrix of QFD as mentioned in Section IV. The trade-offs among the selected design requirements are identified based on whether improving one design requirement have a positive, negative, and/or no effect on other design requirements. Finally, after considering the initial technical ratings found out from the absolute importance and relative importance of the blended value requirements and design requirements, the organisation’s capability, and the cost trade-offs optimized design requirements will be selected to develop the blended value based sustainable e-business model.
VIII. CONCLUSION AND DISCUSSION There are number of ideas and proposals about business
modeling and e-business modeling. But there is no clear proposal or idea about sustainable e-business modeling. Similarly, there are only few thoughts in the literature about ‘blended value’ or shared value. But all of them considered blended value only from customer’s value requirements point of view. In this approach, all of the value requirements (customer, business, and process) are taken into consideration to develop the model. Therefore, this modeling approach is significant for four reasons. Firstly, there are few modeling approaches exist about ‘e-business’ and ‘sustainable business’ separately, but there is no approach available about e-business modeling and sustainability. Secondly, ‘blended (economic, social, environmental)value’ is considered not only from customer’s point of view but also from business’s point of view and value process’s point of view since the fulfillment of only customer’s requirements cannot guarantee long run sustainability. Thirdly, what was not shown before is how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model. Fourthly, this modeling approach shows the way for efficient allocation of resources for the businesses by indicating theimportance level of the value requirements for the sustainability.
We have shown how the proposed model needs to be implemented with detailed formulas after providing extensive literature in this field. We have also identified the necessary tools for this approach and explained the whole research process step by step. Our further research will be directed at the implementation of this approach in the real life businesses. There should not be much difficulty in implementation of this approach in any real life businesses other than accommodating the elements of this approach in different business contexts.
REFERENCES [1] Al-Debei, M.M. and D. Avison, Developing a unified framework of the
business model concept. European Journal of Information Systems, 2010. 19: p. 359-376.
[2] Chan, L.-K. and M.-L. Wu, A systematic approach to quality function deployment with a full illustrative example. Omega : The International Journal of Management Science, 2005. 33(2): p. 119-139.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
114
[3] Zott, C., R. Amit, and L. Massa, The Business Model: Recent Developments and Future Research. Journal of Management, 2011. 37(4): p. 1019-1042.
[4] Alt, R. and H. Zimmerman, Introduction to Special Section - Business Models. Electronic Markets, 2001. 11(1): p. 3-9.
[5] Afua, A. and C. Tucci, eds. Internet Business Models and Strategies. International Editions ed. 2001, McGraw-Hill: New York.
[6] Timmers, P., Business Models for Electronic Markets. Electronic Markets, 1998. 8(2): p. 3-8.
[7] Applegate, L.M., Emerging e-business models: lessons learned from the field. Harvard Business Review, 2001.
[8] Weill, P. and M. Vitale, What IT Infrastructure capabilities are needed to implement e-business models? MIS Quarterly, 2002. 1: p. 17-34.
[9] Rappa, M. Managing the digital enterprise - Business models on the web. 1999 4 April 2011]; Available from: http://digitalenterprise.org/models/models.html.
[10] Dubosson-Torbay, M., A. Osterwalder, and Y. Pigneur, Ebusiness model design, classification and measurements. Thunderbird International Business Review, 2001. 44(1): p. 5-23.
[11] Tapscott, D., A. Lowy, and D. Ticoll, Digital capital: Harnessing the power of business webs. Thunderbird International Business Review, 2000. 44(1): p. 5-23.
[12] Gordijn, J. and H. Akkermans. e3 Value: A Conceptual Value Modeling Approach for e-Business Development. in First International Conference on Knowledge Capture, Workshop Knowledge in e-Business. 2001.
[13] Bell, S. and S. Morse, Sustainability Indicators: measuring the immeasurable2009, London: Earthscan Publications.
[14] Porter, M.E., The big idea: creating shared value. Harvard business review, 2011. 89(1-2).
[15] Akkermans, H., Intelligent e-business: from technology to value. Intelligent Systems, IEEE, 2001. 16(4): p. 8-10.
[16] Houghton, J., ICT and the Environment in Developing Countries: A Review of Opportunities and Developments What Kind of Information Society? Governance, Virtuality, Surveillance, Sustainability, Resilience, J. Berleur, M. Hercheui, and L. Hilty, Editors. 2010, Springer Boston. p. 236-247.
[17] Brigden, K., et al., Cutting edge contamination: A study of environmental pollution during the manufacture of electronic products, 2007, Greenpeace International. p. 79-86.
[18] Greenpeace Guide to Greener Electronics. 2009. [19] WWF/Gartner, WWF-Gartner assessment of global lowcarbon IT
leadership, 2008, Gartner Inc.: Stamford CT. [20] Shrivastava, P., The Role of Corporations in Achieving Ecological
Sustainability. The Academy of Management Review, 1995. 20(4): p. 936-960.
[21] Elliot, S., Transdisciplinary perspectives on environmental sustainability: a resource base and framework for IT-enabled business transformation. MIS Q., 2011. 35(1): p. 197-236.
[22] Figge, F., et al., The Sustainability Balanced Scorecard – linking sustainability management to business strategy. Business Strategy and the Environment, 2002. 11(5): p. 269-284.
[23] Zairi, M. and J. Peters, The impact of social responsibility on business performance. Managerial Auditing Journal, 2002. 17(4): p. 174 - 178.
[24] Akao, Y., Quality Function Deployment (QFD): Integrating customer requirements into Product Design1990, Cambridge, MA: Productivity Press.
[25] Delice, E.K. and Z. Güngör, A mixed integer goal programming model for discrete values of design requirements in QFD. International journal of production research, 2010. 49(10): p. 2941-2957.
[26] Chan, L.-K. and M.-L. Wu, Quality function deployment: A literature review. European Journal of Operational Research, 2002. 143(3): p. 463-497.
[27] Mehrjerdi, Y.Z., Applications and extensions of quality function deployment. Assembly Automation, 2010. 30(4): p. 388-403.
[28] Saaty, T.L., AHP: The Analytic Hierarchy Process1980, New York: McGraw-Hill.
[29] Das, D. and K. Mukherjee, Development of an AHP-QFD framework for designing a tourism product. International Journal of Services and Operations Management, 2008. 4(3): p. 321-344.
[30] Park, T. and K.-J. Kim, Determination of an optimal set of design requirements using house of quality. Journal of Operations Management, 1998. 16(5): p. 569-581.
[31] Han, S.B., et al., A conceptual QFD planning model. The International Journal of Quality & Reliability Management, 2001. 18(8): p. 796.
[32] Mukherjee, K., House of sustainability (HOS) : an innovative approach to achieve sustainability in the Indian coal sector, in Handbook of corporate sustainability : frameworks, strategies and tools, M.A. Quaddus and M.A.B. Siddique, Editors. 2011, Edward Elgar: Massachusetts, USA. p. 57-76.
[33] Bhattacharya, A., B. Sarkar, and S.K. Mukherjee, Integrating AHP with QFD for robot selection under requirement perspective. International journal of production research, 2005. 43(17): p. 3671-3685.
[34] Chan, L.K. and M.L. Wu, Prioritizing the technical measures in Quality Function Deployment. Quality engineering, 1998. 10(3): p. 467-479.
[35] Xie, M., T.N. Goh, and H. Wang, A study of the sensitivity of ‘‘customer voice’’ in QFD analysis. International Journal of Industrial Engineering, 1998. 5(4): p. 301-307.
[36] Wang, H., M. Xie, and T.N. Goh, A comparative study of the prioritization matrix method and the analytic hierarchy process technique in quality function deployment. Total Quality Management, 1998. 9(6): p. 421-430.
[37] Brancheau, J.C., B.D. Janz, and J.C. Wetherbe, Key Issues in Information Systems Management: 1994-95 SIM Delphi Results. MIS Quarterly, 1996. 20(2): p. 225-242.
[38] Hayne, S.C. and C.E. Pollard, A comparative analysis of critical issues facing Canadian information systems personnel: a national and global perspective. Information & Management, 2000. 38(2): p. 73-86.
[39] Holsapple, C.W. and K.D. Joshi, Knowledge manipulation activities: results of a Delphi study. Information & Management, 2002. 39(6): p. 477-490.
[40] Lai, V.S. and W. Chung, Managing international data communications. Commun. ACM, 2002. 45(3): p. 89-93.
[41] Paul, M., Specification of a capability-based IT classification framework. Information & Management, 2002. 39(8): p. 647-658.
[42] Okoli, C. and S.D. Pawlowski, The Delphi method as a research tool: an example, design considerations and applications. Information & Management, 2004. 42(1): p. 15-29.
[43] Grisham, T., The Delphi technique: a method for testing complex and multifaceted topics. International Journal of Managing Projects in Business, 2009. 2(1): p. 112-130.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
115
Protein Structure Prediction in 2D Triangular Lattice
Model using Differential Evolution Algorithm
Aditya Narayan Hati Nanda Dulal Jana
IT Department IT Department
NIT Durgapur NIT Durgapur
Durgapur, India Durgapur, India Email:[email protected] Email:[email protected]
Sayantan Mandal Jaya Sil
IT Department Dept. of CS and Tech.
NIT Durgapur Bengal Eng. and Sc. University
Durgapur, India West Bengal, India Email: [email protected] Email:[email protected]
Abstract—Protein Structure Prediction from primary structure
of a protein is a very complex and hard problem in
computational biology. Here we propose differential evolutionary
(DE) algorithm on 2D triangular hydrophobic-polar (HP) lattice
model for predicting the primary structure of a protein. We
propose an efficient and simple backtracking algorithm to avoid
overlapping of the given sequence. This methodology is
experimented on several benchmark sequences and compared
with other similar implementation. We see that the proposed DE
has been performing better and more consistent than the
previous ones.
Keywords-2D Triangular lattice model; Hydrophobic-polar
model; Evolutionary computation; Differential Evolution; protein;
backtracking and protein structure prediction.
I. INTRODUCTION
Protein plays a key role in all biological process. It is a long
sequence of 20 basic amino acids [2].The exact way proteins
fold just after synthesized in the ribosome is unknown. As a
consequence, the prediction of protein structure from its amino
acid sequence is one of the most prominent problems in
bioinformatics. There are several experimental methods for
protein structure prediction such as MRI (magnetic resonance
imaging) and X-ray crystallography. But these methods are
expensive in terms of equipment, computation and time.
Therefore computational approaches to protein structure
prediction are taken care of. HP lattice model [1] is the
simplest and widely used model. in this paper, 2D triangular
lattice model are used for Protein Structure Prediction because
this model resolves the parity problem of 2D HP lattice model
and gives better structure.
From computational point of view, protein structure
prediction in 2D HP model is NP-complete [6]. It can be transformed into an optimization problem. Recently, several methods are proposed to solve the protein structure prediction problem. But there is no efficient method till now as it is an np hard problem. Here we introduce Differential Evolution algorithm with a simple backtracking for correction to make the sequence self-avoiding. The objective of this work is to evaluate the applicability of DE to PSP using 2D triangular HP
model and to compare its performances with other contemporary methods.
II. 2D TRIANGULAR LATTICE MODEL
HP model is the most widely used models. It was introduced
by Dill et al in 1987 [1]. Here 20 basic amino acids are
classified into two categories (I) hydrophobic and (II) polar
according to the affinity towards water.
When a peptide bond occurs between two amino acids,
those two amino acids are said to be consecutive otherwise those are non-consecutive. When two non-consecutive amino acids are placed side by side in lattice, we say that they are in topological contact. We have to design the model in such a way that the sequences must be self-avoiding and the non-consecutive H amino acids make a hydrophobic core. But these HP lattice model possess a flaw referred as the parity problem. The problem is that when two residues of even distance from one another in the sequence are unable to be placed in such a way that they are in topological contact.
In order to solve this parity problem, the 2D triangular HP
lattice model is introduced [10]. In triangular lattice model,
let and be two primary axes of square lattice. Take an
auxiliary axis = + along the diagonal (Fig 1) and skew it
until the angle between, becomes 120ᴼ (Fig 2). By this way,
we obtain 2D HP triangular lattice model. For example (as in
Fig 3), the lattice point P=(x, y) has six neighbors (x+1, y) as
R, (x-1, y) as L, (x, y+1) as LU, (x, y-1) as RD, (x+1, y+1) as
RU and (x-1, y-1) as LD.
(x+1,y)
(x,y+1)
(x,y-1) (x-1,y-1)
(x+1,y+1)
(x-1,y)
Fig 3: The 2D
triangular lattice
model neighbors of
vertex (x,y).
X
W Y
Fig 1: Adding an
auxiliary axis along
the diagonal of a
square lattice.
X
W Y
Fig 2: Skewing
the square lattice
into a
2Dtriangular
lattice.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
116
III. PREVIOUS WORK
In 2D HP lattice model, a lot of work has been done using
evolutionary algorithms such as simple genetic algorithm and
other variations [9], differential evolution algorithm with
vector encoding scheme for initialization [11] [12] etc. But the
parity problem in this lattice model is a severe bottleneck.
Therefore 2D triangular lattice model is considered [13].
Recently, application of SGA, HHGA and ERS-GA are done
in this model [10]. Also tabu search, particle swarm
optimization, hybrid algorithms (hybrid of GA & PSO) are
applied in this model.
IV. METHODOLOGY
In this section, the strategies proposed to improve the performance of the DE algorithm, applied in protein structure prediction using 2D triangular lattice model are described.
A. Differential Evolution Algorithm
Differential evolution algorithm was introduced by Storn &
Price [3] [4]. It is an evolutionary algorithm, used for
optimization problem. It is particularly useful if the gradient of
the problem is difficult or even impossible to derive. Consider
a fitness (objective) function
To minimize the function , find
Then is called a global minimum for the function . It is
usually not possible to pinpoint the global minimum exactly.
So candidate solutions with sufficiently good fitness value are
acceptable for practical issue.
There are several variants of DE proposed by Storn. We
consider DE/rand/1/bin and DE/best-to-rand/2/exp in this
problem. At first, the first strategy is taken. When stagnation
happens for 100 generations the second strategy is taken. After
that when again stagnation happens the first strategy is taken
and so on. The algorithm (Fig 4) is described below.
Here after initialization of the population, the algorithm goes
after some operations and calculate their fitness values. At
first, mutation happens. There are two types of mutation
strategies has been used. At the starting point, first strategy is
taken. If stagnation occurs for more than 100 next generations,
the second strategy is chosen. If again stagnation occurs in the
next 100 generations then the first one is chosen and so on.
When first strategy is chosen binomial crossover happens and
for second strategy is chosen exponential crossover is chosen.
After that a repair function has been called to convert
infeasible solutions to feasible ones. Then the selection
procedure is done based on the greedy strategy. The
initialization, mutation, crossover and selection is described in
the following section.
1) Initialization
In DE, for each individual component, the upper bound and
lower bound are stored in matrix, called initialization
matrix where D is the dimension of the each individual. The
vector components are created in the following way,
Where
There are basically 3 types of coordinates to represent the
amino acids in lattice, Cartesian coordinates, internal
coordinates and relative coordinates. The proposed DE uses
relative coordinates. Based on this model, there are possible 6
movements L, R, LU, LD, RU, RD defined as from a point
P(x, y). They are as follows: (x, y+1) as LU, (x-1, y) as L, (x-
1, y-1) as LD, (x, y-1) as RD, (x+1, y) as R, (x+1, y+1) as RU.
If the number of amino acids in the given sequence is n, then
total number of moves in the amino acid sequence is (n-1). For
each target vector, we choose randomly (n-1) number of
moves from 1to 6. By this way we initialize the whole
population matrix calculate the energy for each target vector
using the fitness function. If the target vector is infeasible we
set its fitness function value to1.the number of population is
np, which is a parameter of DE. We take the value of np as 5
times of dimension of target vector.
2) Mutation
Mutation is a kind of exploration technique which can explore
very rapidly. It creates trial vector of np numbers. The
mutation process of the first strategy is as follows:
Where
The mutation process of second strategy is as follows:
Where
Here Xbest is the best target vector in that current generation.
is a parameter called weighting factor. Here we
have taken the value of F=0.25*(1+rand). Here rand is a
uniform random generator.
a f
: nf R R
/na R
: ( ) ( )nb R f a f b
f
Lb 2 D
, , 0 (0, 1) ( , , ) ,j i rand bj U bj L bj Lx
0 (0,1) 1rand
(0,1 )F
0 1 2( ) ( )ig r best r best rV X F X X F X X
0 1 2r r r i
0 1 2( )ig r r rV X F X X
0 1 2r r r i
The Eighth International Conference on Computing and Information Technology IC2IT 2012
117
3) Crossover
Crossover is also a kind of exploration technique which
explores in a restricted manner. In DE there are two of basic
crossover techniques. They are (i) binomial crossover and (ii)
exponential crossover. As we use DE/rand/1/bin and DE/best-
to-rand/2/exp, we consider binomial and exponential crossover
both.
Binomial crossover is as follows:
The exponential crossover is as follows:
For all j=<n>D, <n+1>D … <n+L-1>D
For all other
is a parameter called crossover probability. In the
exponential crossover <>D operator means modulo operator.
We have taken Cr=0.8. jrand is a random value chosen from
1to d where d is the dimension of target vector. The pseudo
codes of binomial and exponential crossover are given below:
Here D represents the dimension of each vector. Uj,i is the trial
vector and Vj,i is the donor vector. Xj,i is the target vector.
4) Selection
Selection is an exploitation technique which converges from
local minima to global minimum. After doing mutation and
crossover we have introduced a repair function to repair the
infeasible solutions. Then we calculate fitness function from
the coordinates. The fitness function is the free energy of the
sequence calculated from the model. The minimum energy of
a sequence implies more stability of the molecule. Here we
consider hydrophobic-polar model. By using this concept the
Procedure of free energy calculation is described later. The
repair process is described later. It is a tournament selection
procedure based on the value of fitness function. The
procedure is as follows:
If
B. Repair function
After applying mutation and crossover operation, the initial
solution or target vector becomes infeasible i.e. the sequence
becomes non-self-avoiding. There are three ways to solve such
problem which is discarding the infeasible solution, using of
penalty function and repair function using backtracking. We
proposed here the repair function using backtracking. The
third option is illustrated in fig 7.
The random movement is stored in 'S'. Each node has a value
'back' which stores number of invalid direction. Whenever
back value will be greater than 5 it will cause backtrack. A
pointer 'i' is used which will keep track of current working
node. Now every time when backtrack will occur this 'i'
pointer will decrease and the back value of the node will be set
to 0. For a particular node, if placed its back value will
increase by 1, now if a particular direction is not available then
some strategy is taken which will be followed to place that
amino acid to a new direction. The strategy is, if right is not
available then it will try to place in down direction, now if
down is not available then left, if left is not available then it
will move up direction, and if up is not available then right
direction. Number of attempt to place that particular amino
acid with respect to a particular coordinate will be 4. Now
when value of 'i' is equal to the length of the amino acid and it
means it has repaired the whole folding.
Fig 7: Repair function using backtracking
(0,1)Cr
, ,, ,
, ,
j i gj i g
j i g
VUX
otherwise
( (0,1) )if rand Cr j jrand
, ,, ,
, , j i g
j i gj i g
VUX
[1, ]j D
jr=floor (rand(O,1)*D); //0≤jr<D
j=jr;
do
Uj,i=Vj,i; //Child inherits a mutant parameter.
J= (j+1) %D; //Increment j modulo D.
while (rand(0, 1)<cr && j!=jr); //Take another mutant vector?
While (j! =jr) //Take the rest, if any, from the target.
Uj,i=Vj,;
j=(j+1)%D;
Fig 5: Pseudo code of exponential crossover.
jr=floor (rand(0,1)*D); //0≤jr<D
for j=1 to n
If rand (0,1)<cr or j=jr
Uj,i=Vj,i;
Else
Uj,i=Xj,i;
Fig 6: Pseudo code of binomial crossover.
,, 1
,Ui g
i gXi g
X otherwise
, ,( ) ( )i g i gf U f X
The Eighth International Conference on Computing and Information Technology IC2IT 2012
118
C. The free energy calculation procedure
The free energy of an amino acid sequence is the topological
contact of nonconsecutive H-H contacts. The free energy of a
protein can be calculated by the following formulae:
,
ij ij
i j
E r
Where the parameters
The pair of H and H residues
Others
And
Si and Sj adjacent but not connected amino acid
Otherwise
V. RESULTS AND COMPARISION
For experiments, benchmark sequences of 8 synthetic proteins
have been chosen in 2D HP lattice model [10]. The minimum
energy of these sequences is still unknown in 2D triangular
lattice model. In this model, Simple Genetic Algorithm
(SGA), Hybrid Genetic Algorithm (HGA) and Elite-based
Reproduction Strategy with Genetic Algorithm (ERS-GA)
have been proposed earlier. Comparing our results with these
algorithms’ results it is seen that the DE scheme has
outperformed the previous algorithms also DE works more
consistently in 2D triangular lattice model.
For this experiments, a machine with Pentium Core 2 Duo
(1.6GHz) and 2GB RAM with Linux is used. Octave is used
as the testing platform. We have done 20 times run of each
benchmark sequences with this algorithm and compare with
the available results. Table 1 shows the benchmark sequences
on which we apply our algorithm. Table 2 show the
comparison between the minimum energy found by SGA,
HGA, and ERS-GA. Table 3 show the comparison between
minimum energy (avg/best) between ERS-GA and DE.
From the table 2 it can be observed that the results obtained
from DE is better than the results of others for second and
eighth benchmark sequences. The results in bold cases
represent the better ones. For first, third and fourth sequences
both ERS-GA and DE work fine. For the rest of the
benchmark sequences, ERS-GA gives better results. But if the
observation is done from the table 3 then it can be concluded
that DE works most consistently for these benchmark
sequences as it give higher average results for 6 of 8
benchmark sequences.
TABLE I. LIST OF 8 BENCHMARK SEQUENCES
TABLE II. THE BEST FREE ENERGY OBTAINED BY EACH ALGORITHM
TABLE III. COMPARISON BETWEEN THE RESULTS (AVG/BEST) BETWEEN
ERS-GA AND DE
VI. FUTURE WORK AND CONCLUSION
There are a little bit of exploration in the field of protein
structure prediction in 2D triangular lattice using evolutionary
strategy. In this paper, Differential Evolution algorithm is
implemented for protein structure prediction. Here new type of
encoding scheme is also proposed. Invalid conformations are
repaired by using backtracking method to produce valid
conformations. Our experimental results show very promising
and efficient than other evolutionary algorithms. In future,
better results can be found by upgrading this DE strategy.
There is a lot of scope in DE for upgrading in the area of
initialization, mutation, crossover and selection operations.
The improvement can lead to a better sub-optimal solution for
this problem.
REFERENCES
[1] Lau, K. and Dill, K. A., “A lattice statistical mechanics model of the
conformation and sequence spaces of proteins” Macromolecules, vol. 22, pp. 3986–3997, 1989
[2] Charles J. Epstein, Robert F. Goldberger, and Christian B. Anfinsen. “The genetic control of tertiary protein structure: Studies with model systems” In Cold Spring Harbor Symposium on Quantitative Biology, pages 439–449, 1963. Vol. 28.
[3] R. Stom and K. Price, "Differential Evolution - A Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces", ftp.ICSI.Berkeley.edu/pub/techreports/l9 5/tr--9 5 012.ps.z
[4] R. Storn, "On the usage of differential evolution for function optimization " Biennial Conference of the North American Fuzzy Information Processing Society (NAFIPS), IEEE, Berkeley, 1996, pp. 519-523.
[5] R. Agarwala, S. Batzoglou, V. Dancik, S. Decatur, M. Farach, S. Hannenhali, S. Muthukrishnan, and S. Skiena. Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. Journal of Computational Biology, 4(2):275-296, 1997.
Sequence length Amino acid sequence
1 20 (HP)2PH(HP)2(PH)2HP(PH)2
2 24 H2P2(HP2)6H2
3 25 P2HP2(H2P4)3H2
4 36 P(P2H2)2P5H5(H2P2)2P2H(HP2)2
5 48 P2H(P2H2)2P5H10P6(H2P2)2HP2H5
6 50 H2(PH)3PH4PH(P3H)2P4(HP3)2HPH4(PH)3PH2
7 60 P(PH3)2H5P3H10PHP3H12P4H6PH2PHP
8 64 H12(PH)2((P2H2)2P2H)3(PH)2H11
Sequence SGA HGA ERS-GA DE
1 -11 -15 -15 -15
2 -10 -13 -13 -15
3 -10 -10 -12 -12
4 -16 -19 -20 -20
5 -26 -32 -32 -28
6 -21 -23 -30 -27
7 -40 -46 -55 -49
8 -33 -46 -47 -50
Sequence ERS-GA(avg/best) DE(avg/best)
1 -12.5/-15 -14.8/-15
2 -10.2/-13 -13.4/-15
3 -8.47/-12 -11.2/-12
4 -16/-20 -19.4/-20
5 -28.13/-32 -27/-28
6 -25.3/-30 -26/-27
7 -49.43/-55 -47.2/-49
8 -42.37/-47 -48.2/-50
1.00.0
ij
10
ijr
The Eighth International Conference on Computing and Information Technology IC2IT 2012
119
[6] . Berger, T. Leighton. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. Journal of Computational Biology, 5(1), 27-40, 1998.
[7] Huang C, Yang X, He Z: Protein folding simulations of 2D HP model by the genetic algorithm based on optimal secondary structures. Computational Biology and Chemistry 2010, 34:137-142.
[8] Joel G, Martin M, Minghui J: RNA folding on the 3D triangular lattice. BMC Bioinformatics 2009, 10:369.
[9] Hoque MT, Chetty M, Dooley LS: A hybrid genetic algorithm for 2D FCC hydrophobic–hydrophilic lattice model to predict protein folding. Advances in Artificial Intelligence, Lecture Notes in Computer Science 2006, 4304:867-876.
[10] Shih-Chieh Su, Cheng-Jian Lin, Chuan-Kang Ting: An effective hybrid of hill climbing and genetic algorithm for 2D triangular protein structure
prediction from International Workshop on Computational Proteomics Hong Kong, China. 18-21 December 2010.
[11] H. S. Lopes, Reginaldo Bitello: A Differential Evolution Approach for protein folding using a lattice model. Journal of Computer Science and Technology,22(6):904~908Nov.2007.
[12] N. D. Jana and Jaya Sil. “Protein Structure Prediction in 2D HP lattice model using differential evolutionary algorithms”. In S. C. Satapathy et al (EDs) Proc. Of the Incon INDIA2012, AISC 132,PP. 281-290.2012.
[13] William E. Hart and Alantha Newman, “Protein Structure Prediction with lattice models”, 2001 by CRC Press.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
120
Elimination of Materializations from Left/RightDeep Data Integration Plans
Janusz R. Getta∗∗School of Computer Science and Software Engineering
University of WollongongWollongong, NSW
AustraliaEmail: [email protected]
Abstract—Performance of distributed data processing is one ofthe central problems in the development of modern informationsystems. This work considers a model of distributed system wherea user application running at a central site submits the dataprocessing requests to the remote sites. The results of processingat the remote sites are transmitted back to a central site andare simultaneously integrated into the final outcomes of theapplication.
Due to the external factors like network congestion or highprocessing load at the remote sites, transmission of data toacentral site can be delayed or completely stopped. Then, it isnecessary to dynamically change the original data integrationplan to a new one, which allows for more efficent data integrationin a changed environment.
This work uses a technique of elimination of materializationsfrom the data integration plans to create the alternative dataintegration plans. We propose an algorithm, which finds allpossible data integration plans for a given sequence of datatransmitted to a central site. We show how a data integrationplans can be dynamically changed in a reply to the dynamicallychanging frequencies of data transmission.
I. I NTRODUCTION
Distributed data processing faces an ever increasing demandfor more efficient processing of user applications accessingdata at numerous different locations and integrating the partialresults at a central site. A distributed system based on a globalview data processes the information resources available attheremotes sites through the applications running at a centralsite.A typical user application submits a data processing request toa global view of data, which integrates data resources availableat the remote sites. The request is automatically decomposedinto the elementary requests, which later on, are submittedfor processing at the remote sites. The results of processingat the remote sites are transmitted back to a central site andintegrated with data already available there. Data integrationis performed accordingly to a data integration plan, whichis prepared when a request issued by a user application isdecomposed into the individual requests each one related toa different remote site. A data integration plan determinesan order in which the individual requests are submitted forprocessing at the remote sites and a way how the partialresults of are combined into the final results. Due to the factorsbeyond the control of a central system the transmissions ofpartial results can be delayed or even completely stopped.
Then, the current data integration plan must be dynamicallyadjusted to the changing conditions. This work investigateswhen and how the current data integration plan must bechanged in a reply to the increasing/decreasing intensity oftransmission of data from the remote sites.
The individual requests obtained from the decomposition ofa global request are submitted for processing at the remotesites accordingly toentirely sequentialor entirely parallel,or hybrid, i.e. mixed sequential and parallel strategies. Ac-cordingly to anentirely sequentialstrategy a requestqi canbe submitted for processing at a remote site only when allresults of the requestsq1, . . . , qi−1 are available at a centralsite. An entirely sequential strategy is appropriate when theresults received so far can be used to reduce the complexity ofthe remaining requestsqi, . . . , qi+k. Accordingly to anentirelyparallel strategy all requestsq1, . . . , qi, . . . , qi+k are submittedsimultaneously for the parallel processing at the remote sites.An entirely parallel strategy is beneficial when the compu-tational complexity and the amounts of data transmitted ismore or less the same for all requests. Accordingly to amixedsequential and parallelstrategy some requests are submittedsequentially while the others in parallel. Optimization ofdata integration plans is eitherstatic when the plans areoptimized before a stage of data integration or it isdynamicwhen the plans are changed during the processing of therequests. A static optimization of data integration plans is moreappropriate for parallel strategy than for sequential strategybecause the plans cannot be changed after the submission tothe remote sites. A dynamic optimization of data integrationplans allows for the modification of the individual requestsand change of their order during the processing of an entirerequest. This work considers a dynamic optimization of dataintegration plans for the entirely parallel processing strategyof the individual requests.
The problem of dynamic optimization of data integrationplans in the entirely parallel processing model can be for-mulated in the following way. Given a global informationsystem that integrates a number of remote and independentsources of data. A user requestq is decomposed into theelementary requestsq1, . . . , qn such thatq = E(q1, . . . , qn).The requestsq1, . . . , qn are simultaneously submitted for theprocessing at the remote sites. Letr1, . . . , rn be the individual
The Eighth International Conference on Computing and Information Technology IC2IT 2012
121
results obtained from the processing of the respective requestsq1, . . . , qn. Then, the final result of a requestq can beobtained from the evaluation of an expressionE(r1, . . . , rn).If evaluation of an expressionE(r1, . . . , rn) can be performedin many different ways, for example by changing of an orderof operations, then it means that integration of the individualresults r1, . . . , rn can also be performed in many differentways. If some of the individual results are not available dueto a network congestion then a way howE(r1, . . . , rn) isevaluated can be changed to avoid a deadlock. A probleminvestigated in this paper is how to dynamically adjust evalua-tion of E(r1, . . . , rn) in a response to the changing parametersof data transmission.
One of the specific approaches to data integration ison-line data integration. In online data integration we con-sider an individual replyri as a sequence of data pack-ets ri1 , ri2 , . . . , rik−1
, rikand we perform re-computation of
E(r1, . . . , rn) each time a new packet of data is receivedat a central site. Such approach to data integration is moreefficient because there is no need to wait for the completeresults to start evaluation ofE(r1, . . . , rn). Instead, each timea new packet of data is received at a central site then it isimmediately integrated into the intermediate result no matterwhich remote site it comes from. To perform an online dataintegration, an expressionE(r1, . . . , rn) must be transformedinto a collection of the sequences of elementary operationscalled asdata integration plans, pr1
, . . . , prn. Each one of
the data integration planspridetermines howE(r1, . . . , rn) is
recomputed for the sequences of packetsri1 , ri2 , . . . , rik−1, rik
where i = 1, . . . , n. If an expressionE(r1, . . . , rn) can becomputed in may different ways then it is possible to findmany online data integration plans.Dynamic optimizationofonline data integration plans finds the best processing planforthe sequences of packets of data obtained in the latest periodof time. If the frequences of transmission of individual resultsr1, . . . , rn change in time then dynamic optimization finds adata integration plan, which is the best for the most recentfrequencies of data transmission.
A starting point for the dynamic optimization is a dataintegration expressionE(r1, . . . , rn). Next, a data integrationexpression is transformed into a set ofdata integration planswhere each plan represents an integration procedure for theincrements of one argument ofE(r1, . . . , rn) Some of theplans assume that temporary results of the processing mustbe stored in so calledmaterializationswhile the other plansallow for processing of the same data integration expressionwithout the materializations. A data integration system storesall plans and starts data integration accordingly to a planwith the largest possible number of materializations. Then,whenever frequency of data transmission of a given individualresult grows beyond a given threshold then dynamic optimizerfinds a better data integration plan and changes data integrationaccordingly to a new plan.
The paper is organized in the following way. Section IIoverviews the related works in an area of optimization ofdata integration in distributed systems based on a global data
model. Next, Section III shows how online data integrationplans can be transformed into data integration plans thatinclude the largest possible number of materializations. InSection IV we show when and how materializations can beeliminated from left/right deep data integration plans andwhenand how to dynamically change the current data integrationplan. Finally, section VI concludes the paper.
II. PREVIOUS WORKS
The early works [1], [2] on optimization of query processingin distributed database systems, multidatabase, and federateddatabase systems are a starting point of research on efficientprocessing of data integration.
Reactive query processingstarts from a pre-optimized planand whenever the external factors like network problems orunexpected congestion at a local site or unavailability of datamake the current plan ineffective then further processing iscontinued accordingly to an updated plan. The early workson the reactive query processing techniques were either basedon partitioning [3], [4] or ondynamic modification of queryprocessing plans[5], [6], [7]. If the further computationsare no longer possible then partitioning decomposes a queryexecution plan into a number of sub-plans and it attemptsto continue processing accordingly to the sub-plans. Dynamicmodification of query processing plans finds a new plan equiv-alent to the original one and such that it is possible to continueintegration of the available data. The techniques ofqueryscrambling [8], [9], dynamic scheduling of operators[10],and Eddies[11], dynamically change an order in which thejoin operations are executed depending on the join argumentsavailable on hand.
As data integration requires efficient processing of se-quences of data items an important research directions werethe improvements to pipelined implementation of join oper-ation. These works include new versions of pipelined joinoperation such aspipelined join operator XJoin[12], ripplejoin [13], double pipelined join[14], andhash-merge join[15].
A technique ofredundant computationssimultaneously in-tegrates data accordingly to a number of plans [16]. A conceptof state modulesdescribed in [17] allows for concurrentprocessing of the tuples through the dynamic division of dataintegration tasks.
An adaptive and online processing of data integration plansproposed in [18] and later on in [19] considers the setsof elementary operations for data integration and the bestintegration plan for recently transmitted data. The recentwork[20] considers an integration model where the packets of datacoming from the external sites are simultaneously integratedinto the final result. Another work [21] describes a systemof data integration where the initial and simultaneous dataintegration plans are automatically transformed into hybridplans. where some tasks are processed sequentially while theothers are processed simultaneously.
This work concentrates on simultaneous processing of aspecific class of data integration plans whose syntax treesare only left/righ deep and involve the operations of join and
The Eighth International Conference on Computing and Information Technology IC2IT 2012
122
antijoin from the relational algebra. We show when and howdata integrationn plans must be dynamically changed due tothe changing frequencies of data transmission.
The reviews of the most important data integration tech-niques proposed so far are included in [22], [23], [24], [25],[26].
III. D ATA INTEGRATION EXPRESSIONS
This work applies a relational model of data to formallyrepresent data containers at the remote systems. Letx be anonempty set of attribute names later on called as aschemaand let dom(a) denotes a domain of attributea ∈ x. Atuple t defined over a schemax is a full mappingt : x →∪a∈xdom(a) and such that∀a ∈ x, t(a) ∈ dom(a). Arelational tablecreated on a schemax is a set of tuples overa schemax.
Let r, s be the relational tables such thatschema(r) = x,schema(s) = y respectively and letz ⊆ x, v ⊆ (x ∩ y), andv 6= ∅. The symbolsσφ, πz, ⋊⋉v, ∼v, ⋉ v, ∪, ∩, − denotethe relational algebra operations ofselection, projection, join,antijoin, semijoin, and set algebra operations ofunion, inter-section, anddifference. All join operations are considered tobe equijoin operations over a set of attributesv.
A modificationof a relational tabler is denoted byδ andit is defined as a pair of disjoint relational tables<δ−, δ+>
such thatr ∩ δ− = δ− andr ∩ δ+ = ∅.An data integration operationthat applies a modificationδ
to a relational tabler is denoted byr ⊕ δ and it is defined asan expression(r − δ−) ∪ δ+.
Let E(r1, . . . , ri, . . . , rn) be adata integration expression.In order to perform data integration simultaneously with datatransmission, each time a data packetδi arrives at a centralsite and it is integrated with an argumentri, an expressionE(r1, . . . , ri ⊕ δi, . . . , rn) must be recomputed.
Obviously, processing of the entire expression from verybeginning is too time consuming. It is faster to do it in anincremental way through processing of an incrementδi withthe previous result of an expressionE(r1, . . . , ri, . . . , rn).
Let P(r, s) be an operation of relational algebra. Then,incremental processing ofP(r ⊕ δ, s) can be computedas P(r, s) ⊕ αP (δ, s) where αP(δ, s) is an incremen-tal/decremental operation(id-operation) of an arguments.The incremental processing ofP(r, s ⊕ δ) can be com-puted asP(r, s) ⊕ βP(r, δ) whereβP(r, δ) is an incremen-tal/decremental operation(id-operation) of an arguments.The id-operationsαP (δ, s) and βP(r, δ) for union, join andantijoin operations of the relational algebra are as follows [20]:
α∪(δ, s) =< δ− − s, δ+ − s > (1)
β∪(r, δ) =< δ− − r, δ+ − r > (2)
α⋊⋉(δ, s) =< δ− ⋊⋉v s, δ+⋊⋉v s > (3)
β⋊⋉(r, δ) =< δ− ⋊⋉v r, δ+⋊⋉v r > (4)
α∼(δ, s) =< δ− ∼v s, δ+ ∼v s > (5)
β∼(r, δ) =< r ⋉ vδ+, r ⋉ vδ− > (6)
δw
z
z
z
v
t
r s
U
~
Fig. 1. A syntax tree of a data integration expression(v ∪ (r ⋊⋉x s) ∼y
t)) ⋊⋉z δw)
In this work we consider data integration expressions wherean operation ofprojection (π) is applied only to the finalresult of the computations and operation ofselection(σ) isperformed together with the binary operations ofjoin (⋊⋉) andantijoin (∼). An operation of union is distributive over theoperations of join and antijoin. It is true that(r ∪ s) ⋊⋉x t =(r ⋊⋉x t)∪(s ⋊⋉x t) and that(r∪s) ∼x t = (r ∼x t)∪(s ∼x t)and that t ∼x (r ∪ s) = (t ∼x r) ∼x s. It means thatunion operations can always be processed and the end of thecomputation of data integration expression. Hence, withoutloosing generality we consider only data integration expres-sions built over the operation of join and antijoin. A sampledata integration expression(v ∪ (r ⋊⋉x s) ∼y t)) ⋊⋉z δw) hasa syntax tree given in Figure (1).
As a simple example consider a data integration expressionE(r, s, t) = t ∼v (r ⋊⋉z s). Assume that we would like tofind how an incrementδs =<∅, δ+
s > of an arguments canbe processed in an incremental way, i.e. we would like torecompute an expressionE(r, s⊕δs, t) using the pevious resultE(r, s, t) and the incrementδs. Application of the equations(4) and (6) provides a solutionE(r, s, t) − (t ⋉ v(δ
+s ⋊⋉ r)).
Next, we consider the processing of an incrementδt =<∅, δ+
t > of a remote data sourcet. In this case we needeither materialization of an intermediate result of a subex-pression(r ⋊⋉z s) or transformation of the data integrationexpression into an equivalent one with either left- or right-deep syntax tree and with an argumentt in the leftmost orrightmost position of the tree.
If a materializationmrs = r ⋊⋉ s is maintained duringthe processing of data integration expression then from anequation (4) we getδrst =< ∅, δ+
t ∼v mrs > and theincremental processing is performed accordingly toE(r, s, t)∪(δ+
t ∼v mrs).Maintenance of a materializationmrs decreases the perfor-
mance of data integration because each time the incrementsδ+r and δ+
s are reprocessed a materializatonmrs must beintegrated with the resultsδ+
r ⋊⋉ mrs and δ+s ⋊⋉ mrs. If the
incrementsδ+r and δ+
s arrive frequently at a data integrationsite then a materializationmrs must be frequently integratedwith the partial results.
If a schema ofδt has common attributes inx with r thenit is possible to transform an expressionδ+
t ∼v (r ⋊⋉ s))into δ+
t ∼v ((r ⋉ xδ+t ) ⋊⋉ s). Then the computations of
r ⋉ xδ+t ) ⋊⋉ s can be performed faster than(r ⋊⋉ s) because
The Eighth International Conference on Computing and Information Technology IC2IT 2012
123
an incrementδt is small and it can be kept all the time in fasttransient memory. We shall call such transformation aselimi-nation of materializationfrom a data integration expression.
IV. DATA INTEGRATION PLANS
A data integration expression is transformed into a setof data integration planswhere each plan represents anintegration procedure for the increments of one argument ofthe original expression. In our approach a data integrationplanis a sequence of so calledid-operationson the increments ordecrements of data containers and other fixed size containers.In order to reduces the size of arguments, static optimizationof data integration plans moves the unary operation towardsthe beginning of a plan. Additionally, the frequently updatedmaterializations are eliminated from the plan and constantarguments and subexpressions are replaced with the pre-computed values A data integration plan is a sequence ofassignment statementss1, . . . , sm where the right hand sideof each statement is either an application of a modification toa data container (mj := mj ⊕ δi) or an application of left orright id-operation (δj := αj(δi, mk)). A transformation of dataintegration expression into data integration plans is describedin [20]. As a simple example consider a data integrationexpressionE(r, s, t) = t ∼v (r ⋊⋉z s).The data integration planspr and ps for the increments ofargumentr ands are the following.pr : δrs := δr ⋊⋉z s;mrs := mrs ⊕ δrs;δrst := t ⋉ δrs;result := result − δrst;
ps : δrs := r ⋊⋉z δs;mrs := mrs ⊕ δrs;δrst := t ⋉ δrs;result := result − δrst;
A data integration plan for an argumentt is the following.δrst := δt ∼v mrs;result := result ∪ δrst;
A data integration plan can also be represented as anextended syntax tree where the materializations are representedas square boxes attached to the edges of a tree, for example seeFigure (2), or Figure (4). We say that data integration plan isa left/right deep data integration plan if it has a left/right deepsyntax tree, i.e. a tree such that its every non-leaf node hasat most one non-leaf descendant node, see Figures (2) or (3).In this work we consider only left/right deep data integrationplans.
V. ELIMINATION OF MATERIALIZATIONS
Elimination of materializations from data integration plansis motivated by the performance reasons. When a stream ofdata passing through operation of integration with a materi-alized view, for example in a statementmrs := mrs ⊕ δrs;in the example above, is too large then integration takes too
δ(de)
rsm~
~
d
t(bd)
r(ab) s(ac)
Fig. 2. A case when a materializationmrs cannot be removed from a dataintegration plan forδde.
much time. Integration of the increments of data with a ma-terialization is needed in left/right deep data integration planswhen its incremented argument is not one of two argumentsat the lowest level of its syntax tree. A simple solution tothis problem would be to transform a left/right deep dataintegration plan such that an incremented argument is locatedat the bottom level of the syntax tree. Such transformation isalways possible when a data integration plan is built only overthe join operations. When a data integration expression is builtover join and antijoin operation then in some cases the mate-rializations cannot be removed. For example, it is impossibleto eliminate materializationmrs from a data integration planwhose syntax tree is given in Figure (2) because the incrementsδ(de) have no common attributes with the argumentss(ab) andr(ac). Elimination of materializations from data integrationplans is controlled by the following algorithm.Algorithm (1)
(1) Consider a fragment of data integration plan where anincrementδ(z) and materializationm(y) are involved inoperationα(δ(z), m(y)), see Figure (3). The operationis performed over a set of attributexα. An objective isto eliminate materializationm(y) from the computationsof operationα ∈ ⋊⋉,∼, ⋉ . A materializationm(y)is a result of an operationβ(r(v), s(w)) where β ∈⋊⋉,∼, ⋉ . At most one of the arguments of operationβ(r(v), s(w)) is a materialization.
(2) If r(v) is not a materialization andxα ∩ v is not emptythen r(v) can be reduced tor(v) ⋉ δ(xα) whereδ(xα)is a projection ofδ(z) on xα.
(3) If s(w) is not a materialization andxα ∩w is not emptythens(w) can be reduced tos(w) ⋉ δ(xα).
(4) If both r(v) or s(w) are reduced then no more material-ization can be eliminated because a leaf level of left/rightdeep syntax tree of data integration expression has beenreached.
(5) If either r(v) or s(w) is a materialization then considera subtree with an operation⋉ in the root node as anoperationα. Next, considerδ(z) as one of the argumentsof operationα and eitherr(v) or s(w) as the secondargument of operationα. Finally, consider operationβwhose results are eitherr(v) or s(w). Next, re-apply thealgorithm from the step (1).
Correctness of the Algorithm (1) comes from the followingobservations. A result of operationαxα
(δ(z), m(y))) does not
The Eighth International Conference on Computing and Information Technology IC2IT 2012
124
δ(z)
xα
xβ
α
β
m(y)
r(v) s(w)
Fig. 3. Syntax tree of a fragment of data integration plan
m (abc)2
1m (bc)
~
~
δ(abd)
ab
r(ac)
t(bd)s(bc)
c
b
Fig. 4. Syntax tree of data integration plan
change ifm(y) is reduced byδ(z) i.e. it is the same as areusult ofαxα
(δ(z), m(y) ⋉ δ(z))) for any α ∈ ⋊⋉,∼, ⋉ .As m(y) is a result ofβxβ
(r(v), s(w))) and operation⋉ isdisributive overβ ∈ ⋊⋉,∼, ⋉ then δ(z) can be used toreduce either one or bothr(v) ands(w). Hence an operationβ can be effectively recomputed on the reduced arguments andthere is no need to store a materializationm(y).
As an example consider a syntax tree of data integrationplan together with the materialization required for the compu-tation of are given in Figure (4). Application of the steps (1),(2) and (3) of Algorithm (1) provides the reductions ofr(ac)to r(ac) ⋉ aδ(a) andm1(bc) to m1(bc) ⋉ bδ(b). It allows forelimination of a materializationm2(abc). Application of step(5) and later on the repetion of steps (1), (2), and (3) providesthe reductionss(bc) ⋉ bδ(b) and t(bd) ⋉ bδ(b). It allows forelimination of a materializationm1(bc).
Algorithm (1) can be used for generation of all alternativedata integration plans for processing of all arguments ofdata integration expressions. An important problem is whena materialization should be removed from a data integrationplan, or speaking in another way, when a plan that uses amaterialization should be replaced with another plan that doesnot use a materialization when processing the increments ofthe same argument.
A decision whether a materialization must be deleted de-pends on time spent on its maintenance, i.e. time spent onrecomputation of a materialization after one of its argumentshas changed. A more efficient way to ”refresh” materializationis to integrate the previous state of materialization with theincrements of data ”passing” through ”materialization node”in a syntax tree of data integration plan. Then, eliminationofmaterialization simply depends on the amounts of incrementsof data to be integrated with a materialization. If such amountsof data exceed a given threshold in given fixed period of timethen an alternative plan that does not use the materializationmust be considered. If due to the large processing costs amaterialization must be removed from a left/right deep dataintegration plan then all materializations ”located” above the
materialization considered in a left/right deep syntax tree mustbe eliminated. This is because in left/right deep syntax treesthere is only one path of data processing from the leaf nodesto a root node and the increments of the arguments locatedat the higher levels of the tree add to the increments comingfrom the arguments at the lower levels of the tree. It meansthat at any moment of data integration process there is thetopmost materialization in a syntax tree still beneficial forthe processing and whenever it is possible all materializationsabove it are not used for data integration.
Of course it may happen that the amounts of data passingthrough a ”materialization node” drop below a threshold andaplan that involves such materialization and all materializationsabove must be restored. In order to quickly restore the presentstate of materialization without recomputing it from scratchwe record the increments data passing through the ”material-ization nodes”. The saved increments are integrated with thelatest state of materialization to get its most up-to-date state.Elimination and restoration of materializations is controlled bythe following algorithm.
Algorithm (2)
(1) Consider a left/right deep syntax tree of data integrationplan where the materializationsm1, m2, . . . , mk are lo-cated along the edges of the tree starting from the lowestedge in the tree. The amounts of data that have to beintegrated with the materializations in a given period oftime τ are recorded at each materialization node. Initially,all materializations are empty and all materialization aremaintained in the data integration plans.
(2) At the end of every period of timeτ check if the amountsof data to be integrated withm1, m2, . . . , mk do notexcess a treshold valuedmax. If the amounts of datathat have to be integrated with a materializationmi
exceeddmax in the latest period of timeτ then wheneverit is possible the plans that use the materializationsmi, mi+1, ...mk are replaced with the plans that do notuse these materializations. Additionally, the incrementsofdata δ
(1)i , δ
(2)i , . . . , δ
(1)i+1, δ
(2)i+1, . . . , δ
(1)k , δ
(2)k , . . . passing
through the ”materialization nodes”mi, mi+1, ...mk arerecorded by the system.
(3) If the amounts of data passing through a ”materializationnode” mj , i > j increase abovedmax then the material-izationsmj , mj+1, . . .mi−1 must be removed from dataintegration plans in the same way as in a step (2).
(4) If the amounts of data passing through a ”materializationnode” mj , i < j increase abovedmax then the plans arenot changed.
(5) If the amounts of data passing through a ”materializa-tion node” mj , j > i decrease belowdmax then thecurrent states of materializationsmj , mj−1, . . . , mi mustbe restored from the recorded sequences of incrementsδ(1)j , δi
(2), . . . , δ(1)j−1, δ
(2)j−1, . . . , δ
(1)i , δ
(2)i , . . . and the old
states of the materializations.(6) If the amounts of data passing through a ”materialization
node” mj , j < i decrease belowdmax and the amounts
The Eighth International Conference on Computing and Information Technology IC2IT 2012
125
of data passing through the materialization nodesmj andabove do not change then the plans are not changed.
Time complexity of the algorithm isO(n) wheren is the totalnumber of operation nodes in left/right deep syntax tree of dataintegration experession. The algorithm sequentially updatesthe total amounts of data passing through the materializationnodes at the end of period of timeτ . Whenever a change ofexecution plan is required the mew plans are taken from atable, which is also sequentially searched.
VI. SUMMARY, OPEN PROBLEMS
Elimination of materializations from data integration plansis required when the maintenance of materializations becomestoo time consuming due to the increased intensity of dataincrements passing through the ”materialization nodes” inthedata integration tree. Then, it is worth to replace the currentplans with the new ones that do not use the materializations.This work shows how to construct the left/right deep dataintegration plans that do not use given materializations andwhen construction of such plans is possible. In particular,wedescribe an algorithm that generates all data integration plansfor a given data integration expression and a given set ofarguments. We also show when a materialization cannot beremoved from a data integration plans. Next, we propose aprocedure that dynamically changes data integration plansina reply to the increasing costs of maintenance of the selectedmaterializations.
Data integration plans considered in this work are limitedto left/right deep plans, i.e. the plans whose syntax tree isleft/right deep syntax tree. In a general case, some of thedistributed database applications do not have left/right deepplans or their ”bushy” plans cannot be transformed into theequivalent left/right deep plans. More research is needed toconsider elimination of materializations from ”bushy” dataintegration plans.
Another area, that still need more research is more preciseevaluation of the costs and benefits coming from eliminationof materializations. The algorithm proposed in this workconsiders only the benefits coming from elimination of dataintegration at materialization ”maintenance nodes”. The costsinclude the additional operations that must be performed onthe increments and other arguments of data integration plans.An interesting problem is what happens when a materializa-tion must be restored due to changing intensity of arrivingincrements of data. The costs involved are not included intothe balance of costs and benefits in the current model. It is alsointeresting how the materializations can be restored to themostup to date state in a more efficient way than by re-applyingthe stored modifications.
REFERENCES
[1] V. Srinivasan and M. J. Carey, “Compensation-based on-line queryprocessing,” inProceedings of the 1992 ACM SIGMOD InternationalConference on Management of Data, 1992, pp. 331–340.
[2] F. Ozcan, S. Nural, P. Koksal, C. Evrendilek, and A. Dogac, “Dynamicquery optimization in multidatabases,”Bulletin of the Technical Com-mittee on Data Engineering, vol. 20, pp. 38–45, March 1997.
[3] R. L. Cole and G. Graefe, “Optimization of dynamic query evaluationplans,” in Proceedings of the 1994 ACM SIGMOD International Con-ference on Management of Data, 1994.
[4] N. Kabra and D. J. DeWitt, “Efficient mid-query re-optimization ofsub-optimal query execution plans,” inProceedings of the 1998 ACMSIGMOD International Conference on Management of Data, 1998.
[5] J. Chudziak and J. R. Getta, “On efficient query evaluation in multi-database systems,” inSecond International Workshop on Advances inDatabase and Information Systems, ADBIS’95, 1995, pp. 46–54.
[6] J. R. Getta and S. Sedighi, “Optimizing global query processing plansin heterogeneous and distributed multi database systems,”in 10th Intl.Workshop on Database and Expert Systems Applications, DEXA1999,1999, pp. 12–16.
[7] J. R. Getta, “Query scrambling in distributed multidatabase systems,”in 11th Intl. Workshop on Database and Expert Systems Applications,DEXA 2000, 2000.
[8] T. Urhan, M. J. Franklin, and L. Amsaleg, “Cost based query scramblingfor initial delays,” in SIGMOD 1998, Proceedings ACM SIGMODInternational Conference on Management of Data, June 2-4, 1998,Seattle, Washington, USA, 1998, pp. 130–141.
[9] L. Amsaleg, J. Franklin, and A. Tomasic, “Dynamic query operatorscheduling for wide-area remote access,”Journal of Distributed andParallel Databases, vol. 6, pp. 217–246, 1998.
[10] T. Urhan and M. J. Franklin, “Dynamic pipeline scheduling for im-proving interactive performance of online queries,” inProceedings ofInternational Conference on Very Large Databases, VLDB 2001, 2001.
[11] R. Avnur and J. M. Hellerstein, “Eddies: Continuously adaptive queryprocessing,” inProceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data, 2000, pp. 261–272.
[12] T. Urhan and M. J. Franklin, “Xjoin: A reactively-scheduled pipelinedjoin operator,”IEEE Data Engineering Bulletin 23(2), pp. 27–33, 2000.
[13] P. J. Haas and J. M. Hellerstein, “Ripple joins for online aggregation,” inSIGMOD 1999, Proceedings ACM SIGMOD Intl. Conf. on Managementof Data, 1999, pp. 287–298.
[14] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S.Weld, “Anadaptive query execution system for data integration,” inProceedings ofthe 1999 ACM SIGMOD International Conference on ManagementofData, 1999, pp. 299–310.
[15] M. F. Mokbel, M. Lu, and W. G. Aref, “Hash-merge join: A non-blocking join algorithm for producing fast and early join results,” 2002.
[16] G. Antoshenkov and M. Ziauddin, “Query processing and optmizationin oracle rdb,”VLDB Journal, vol. 5, no. 4, pp. 229–237, 2000.
[17] V. Raman, A. Deshpande, and J. M. Hellerstein, “Using state modulesfor adaptive query processing,” inProceedings of the 19th InternationalConference on Data Engineering, 2003, pp. 353–.
[18] J. R. Getta, “On adaptive and online data integration,”in Intl. Workshopon Self-Managing Database Systems, 21st Intl. Conf. on DataEngineer-ing, ICDE’05, 2005, pp. 1212–1220.
[19] ——, “Optimization of online data integration,” inSeventh InternationalConference on Databases and Information Systems, 2006, pp. 91–97.
[20] ——, “Static optimization of data integration plans in global informationsystems,” in13th International Conference on Enterprise InformationSystems, June 2011, pp. 141–150.
[21] ——, “Optimization of task processing schedules in distributed informa-tion systems,” inInternational Conference on Informatics Engineeringand Information Science, 2011, November 2011.
[22] L. Bouganim, F. Fabret, and C. Mohan, “A dynamic query processingarchitecture for data integration systems,”Bulletin of the TechnicalCommittee on Data Engineering, vol. 23, no. 2, pp. 42–48, June 2000.
[23] G. Graefe, “Dynamic query evaluation plans: Some course corrections?”Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 2,pp. 3–6, June 2000.
[24] J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande,K. Hildrum, S. Madden, V. Raman, and M. A. Shah, “Adaptive queryprocessing: Technology in evolution,”Bulletin of the Technical Commit-tee on Data Engineering, vol. 23, no. 2, pp. 7–18, June 2000.
[25] Z. G. Ives, A. Y. Levy, D. S. Weld, D. Florescu, and M. Friedman,“Adaptive query processing for internet applications,”Bulletin of theTechnical Committee on Data Engineering, vol. 23, no. 2, pp. 19–26,June 2000.
[26] A. Gounaris, N. W. Paton, A. A. Fernandes, and R. Sakellariou,“Adaptive query processing: A survey,” inProceedings of 19th BritishNational Conference on Databases, 2002, pp. 11–25.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
126
A variable neighbourhood search heuristicfor the design of codes
R. Montemanni, M. SalaniIstituto Dalle Molle di Studi sull’Intelligenza ArtificialeScuola Universitaria Professionale della Svizzera Italiana
Galleria 2, 6928 Manno, Canton Ticino, SwitzerlandEmail: roberto, [email protected]
D.H. Smith, F. HuntDivision of Mathematics and Statistics
University of GlamorganPontypridd CF37 1DL, Wales, United Kingdom
Email: dhsmith, [email protected]
Abstract—Codes play a central role in information theory.A code is a set of words of a given length from a givenalphabet of symbols. The words of a code have to fulfil someapplication-dependent constraints in order to guarantee someform of separation between the words. Typical applications ofcodes are for error-correction following transmission or storage ofdata, or for modulation of signals in communications. The targetis to have codes with as many codewords as possible, in order tomaximise efficiency and provide freedom to engineers designingthe applications. In this paper a variable neighbourhood searchframework, used to construct codes in a heuristic fashion, isdescribed. Results on different types of codes of practical interestare presented, showing the potential of the new tool.
Index Terms—Code design, heuristic algorithms, variableneighbourhood search.
I. INTRODUCTION
Code design is a central problem in the field of infor-mation theory. A code is a set of words of a given length,composed from a given alphabet, and with some application-dependent characteristics that typically guarantee some formof separation between the words. Codes are usually adoptedfor error correction of data, or for modulation of signalsin communications. Codes have also found use in somebiological applications recently [1]. The target is normally tohave codes with as many words as possible. In the engineeringapplications this maximises efficiency and provides engineerswith the maximum possible freedom when designing commu-nication systems or other specific applications, as described inthe second part of this paper. This choice of target makes itnatural to formalise the problem as a combinatorial optimisa-tion problem. Depending on the underlying real applications,several types of code can be of practical interest.
Many approaches to solve these problems have been pro-posed in recent decades. So far, most research effort toconstruct good codes has been based on abstract algebraand group theory [2], [3], while only a marginal explorationof heuristic algorithms has been carried out. In fact codesfor error correction do need an algebraic construction toensure efficient decoding. This is not the case in some otherapplications, for which heuristic techniques can be used.
In this paper a set of heuristic algorithms will be de-scribed, and results obtained with them on some code designproblems will be summarised. These problems, which have
been previously studied in the literature, are used in differentpractical applications. The paper demonstrates that heuristicsare a valuable additional tool that can be successfully used indesigning good codes.
II. CODE DESIGN PROBLEMS
A code is a set of words of a given length defined over agiven alphabet that fulfils some defined properties. The mosttypical constraint is on the Hamming distance between eachpair of words. The Hamming distance d(x,y) between twowords x and y is defined as the number of positions in whichthey differ. The minimum distance of a code is the minimumHamming distance between any pair of words of the code.Some side-constraints that depend on the specific applicationfor which codes are defined, are also present. The objectiveof the problem is to find a code that fulfils all the constraintsand contains the maximum possible number of words.
Code design problems can easily be described in termsof combinatorial optimisation, making it possible to applyheuristic optimisation algorithms to them.
III. A VARIABLE NEIGHBOURHOOD SEARCH FRAMEWORK
A Variable Neighbourhood Search (VNS) algorithm [4] thatcombines a set of local search routines is presented. First, thelocal searches embedded in the algorithm are briefly described.The interested reader is referred to [5] for more details.
A. Seed Building
A simple heuristic method to build codes examines allpossible words in a given order, and incrementally acceptswords that are feasible with respect to already accepted ones.The Seed Building method is built on these orderings, whichare combined with the concept of seed words [6]. These seedwords are an initial set of feasible words to which words areadded in a given (problem dependent) order if they satisfy thenecessary criteria.
The set of seeds is initially empty, and one feasible randomseed is added at a time. If the new seed set leads to goodresults when a code is built from it, the seed is kept and a newrandom seed is designated for testing. This increases the sizeof the seed set. The same rationale, which is based on somesimple statistics, is used to decide whether to keep subsequent
The Eighth International Conference on Computing and Information Technology IC2IT 2012
127
seeds or not. If after a given number of iterations the qualityof the solutions provided by a set of seeds is judged to benot good enough, the most recent seed is eliminated from theset, which results therefore in a reduction in the size of theseed set. In this way the set of seeds is expanded or contracteddepending on the quality of the solutions built using the setitself. What happens in practice is that the size of the seed setoscillates through a range of small values. The algorithm isstopped after a given time has elapsed.
B. Clique Search
The idea at the basis of this local search method is thatit is possible to complete a partial code in the best possibleway by solving a maximum clique problem (see [7], [8]).More precisely, given a code, a random subset of the words isremoved, leaving a partial code. It is possible to identify all thefeasible words compatible with those already in the code, andbuild a graph from these words, where words are representedby vertices. Two vertices are connected if and only if the pairof words respect all of the constraints considered. It is thenpossible to run a maximum clique algorithm on the graph inorder to complete the partial code in the best possible way.Heuristic or exact methods can be used to solve the maximumclique problem. In the implementation described here an exactor truncated version of the algorithm presented in [9] is used.The search is run repeatedly, with different random subsets.The algorithm is stopped after a given time has elapsed.
C. Hybrid Search
This Hybrid Search method merges together the main con-cepts at the basis of the two local search algorithms describedin Sections III-A and III-B. There is a (small) set SeedSetof words, that play the role of the seeds of algorithm SeedBuilding. A set of words which are compatible with theelements of SeedSet (as in Clique Search), which are alsocompatible with each other in a weaker form, are identifiedand saved in a set V . More precisely, there is a parameter µ
which is used to model the concept of weak compatibility: thewords in V have to be compatible with each other accordingto a relaxed distance d′ = d−µ .
Weak compatibility is used to keep the set V at reasonablesizes, even in case of a very small set of seeds. A compatibilitygraph, for the creation of which the original distance d isused, is built on the vertex set V , as described in SectionIII-B. A maximum clique problem is solved on this graph. Themechanism described in Section III-A for the management ofseed words, is adopted unchanged here. In this way the setSeedSet is expanded and contracted during the computation.The algorithm is stopped after a given time has elapsed.
D. Iterated Greedy Search
This method is different from the local search approachespreviously described in the way solutions are handled. Thealgorithms described in the previous sections maintain a setof feasible words (i.e. respecting all the constraints) and try toenlarge this set. The Iterated Greedy Search method, which
is inspired by the method discussed in [10], works on aninfeasible set of words (i.e. not all of the words are compatiblewith each other, according to the constraints). The methodevolves by modifying words of a current solution S withthe target of reducing a measure In f (S) of the constraintviolations. If no violation remains, then a feasible solutionhas been retrieved.
In more detail, the local search method works as follows.An (infeasible) solution S is created by replacing a givenpercentage of the words of a given feasible solution by ran-domly generated words. A random word is added to solutionS. The following operations are then repeated until a feasiblesolution has been retrieved (i.e. In f (S)= 0), or a given numberof iterations has been spent without improvement. A wordcw of solution S is selected at random, and the change ofone of its coordinates that guarantees the maximum decreasein the infeasibility measure is selected (ties among possiblemodifications of cw are broken randomly). The word cw isthen modified accordingly.
When a feasible solution is retrieved, it is saved and theprocedure is repeated, starting from the new solution, other-wise the last feasible solution is restored and the procedure isapplied to this solution. The algorithm is stopped after a giventime has elapsed.
E. A Variable Neighbourhood Search approach
Variable Neighbourhood Search (VNS) methods have beendemonstrated to perform well and are robust (see [4]). Suchalgorithms work by applying different local search algorithmsone after the other, aiming at differentiating the characteristicsof the search-spaces visited (i.e. changing the neighbourhood).The rationale behind the idea is that combining togetherdifferent local search methods, that use different optimisationlogics, can lead to an algorithm capable of escaping from thelocal optima identified by each local search algorithm, withthe help of the other local search methods.
In our context, some of the local search methods previouslydescribed are applied in turn, starting each time from thebest solution retrieved since the beginning (or from an emptysolution, in the case of Seed Building). The algorithm isstopped after a given time has elapsed.
IV. CONSTANT WEIGHT BINARY CODES
A constant weight binary code is a set of binary vectors oflength n, weight w and minimum Hamming distance d. Theweight of a binary vector (or word) is the number of 1’s inthe vector. The minimum distance of a code is the minimumHamming distance between any pair of words. The maximumpossible number of words in a constant weight code is referredto as A(n,d,w).
Apart from their important role in the theory of error-correcting codes, constant weight codes have also found appli-cation in fields as diverse as the design of demultiplexers fornano-scale memories [11] and the construction of frequencyhopping lists for use in GSM networks [12].
The Eighth International Conference on Computing and Information Technology IC2IT 2012
128
TABLE IIMPROVED CONSTANT WEIGHT BINARY CODES.
Problem Old New Problem Old New Problem Old New Problem Old New Problem Old NewLB LB LB LB LB LB LB LB LB LB
A(41,6,5) 755 779a A(48,6,6) 7845 7869a A(31,8,7) 363 375f A(51,10,6) 60 64d A(40,10,8) 318 324d
A(42,6,5) 817 841b A(49,6,6) 8568 8605b A(32,8,7) 403 418f A(52,10,6) 60 67a A(41,10,8) 353 362d
A(43,6,5) 874 910b A(50,6,6) 9348 9380b A(33,8,7) 444 466a A(53,10,6) 63 70d A(42,10,8) 390 398a
A(44,6,5) 941 975b A(51,6,6) 10175 10210b A(34,8,7) 498 516a A(54,10,6) 65 73a A(43,10,8) 432 445f
A(45,6,5) 1009 1030b A(33,8,5) 44 45d A(35,8,7) 555 570a A(55,10,6) 68 73h A(44,10,8) 484 487a
A(46,6,5) 1097 1114a A(38,8,6) 231 236b A(36,8,7) 622 637b A(56,10,6) 70 79d A(45,10,8) 532 544c
A(47,6,5) 1172 1181b A(39,8,6) 252 254b A(37,8,7) 696 718a A(57,10,6) 70 83d A(46,10,8) 590 595a
A(48,6,5) 1254 1269b A(40,8,6) 275 281b A(38,8,7) 785 795b A(58,10,6) 72 85b A(47,10,8) 642 656e
A(49,6,5) 1343 1347c A(41,8,6) 294 297f A(39,8,7) 869 893a A(59,10,6) 77 87c A(48,10,8) 711 720e
A(50,6,5) 1429 1459a A(43,8,6) 343 347f A(40,8,7) 977 999a A(60,10,6) 79 91a A(49,10,8) 776 785e
A(51,6,5) 1517 1543b A(44,8,6) 355 381f A(41,8,7) 1095 1110a A(61,10,6) 83 94a A(50,10,8) 852 858e
A(52,6,5) 1617 1654a A(45,8,6) 381 403f A(42,8,7) 1206 1227b A(62,10,6) 84 98c A(51,10,8) 929 934e
A(53,6,5) 1719 1758c A(46,8,6) 411 432f A(43,8,7) 1347 1365a A(29,10,7) 37 39d A(52,10,8) 1007 1018e
A(54,6,5) 1822 1840f A(47,8,6) 440 463c A(44,8,7) 1478 1503a A(36,10,7) 75 78c A(55,10,8) 1289 1296e
A(55,6,5) 1936 1948b A(48,8,6) 477 494b A(45,8,7) 1639 1653f A(42,10,7) 133 137c A(56,10,8) 1405 1408a
A(32,6,6) 1353 1369a A(49,8,6) 501 527f A(46,8,7) 1795 1813f A(56,10,7) 351 358b A(32,12,7) 9 10A(33,6,6) 1528 1560a A(50,8,6) 542 567f A(47,8,7) 1987 2001f A(57,10,7) 366 374b A(33,12,7) 10 11A(34,6,6) 1740 1771b A(51,8,6) 576 606f A(48,8,7) 2173 2197f A(58,10,7) 394 399b A(36,12,7) 15 16A(35,6,6) 1973 1998b A(52,8,6) 609 640c A(49,8,7) 2376 2399b A(59,10,7) 414 423b A(37,12,7) 16 17A(36,6,6) 2240 2264b A(53,8,6) 650 687c A(50,8,7) 2603 2615f A(60,10,7) 431 449b A(39,12,7) 19 21A(37,6,6) 2539 2560f A(54,8,6) 682 726b A(51,8,7) 2839 2866f A(61,10,7) 458 474b A(40,12,7) 20 22b
A(38,6,6) 2836 2860b A(55,8,6) 729 768f A(52,8,7) 3101 3118f A(62,10,7) 486 497b A(37,12,8) 40 42d
A(39,6,6) 3167 3208a A(56,8,6) 766 815f A(53,8,7) 3376 3384f A(63,10,7) 514 526b A(38,12,8) 40 45a
A(40,6,6) 3545 3575a A(57,8,6) 830 866f A(54,8,7) 3651 3667b A(30,10,8) 92 93d A(43,14,8) 10 12A(41,6,6) 3964 3983a A(58,8,6) 872 912f A(55,8,7) 3941 3989b A(33,10,8) 134 140d A(44,14,8) 12 13A(42,6,6) 4397 4419b A(59,8,6) 935 965f A(56,8,7) 4270 4318b A(34,10,8) 156 162a A(45,14,8) 12 15A(43,6,6) 4860 4890b A(60,8,6) 982 1019f A(59,8,7) 5384 5386f A(35,10,8) 176 182d A(46,14,8) 13 17A(44,6,6) 5378 5414a A(61,8,6) 1028 1077f A(45,10,6) 49 50a A(36,10,8) 198 205d A(47,14,8) 15 18A(45,6,6) 5933 5959b A(62,8,6) 1079 1130f A(48,10,6) 56 57a A(37,10,8) 223 230d A(48,14,8) 18 19A(46,6,6) 6521 6552a A(63,8,6) 1143 1195f A(49,10,6) 56 59f A(38,10,8) 249 259a
A(47,6,6) 7160 7194a A(30,8,7) 327 340f A(50,10,6) 56 62f A(39,10,8) 285 291d
a Later improved in [13] by a specific group of automorphisms or a combinatorial construction.b Later improved in [13] by heuristic polishing of a group code or a combinatorial construction.c Later improved in [13] by shortening a code of length n+1 and weight w.d Later improved in [13] by a cyclic group.e Later improved in [13] by shortening a code of length n+1 and weight w or w+1 and a heuristic improvement.f Later improved in [13] by an unspecified method.
A VNS algorithm combining Seed Building and CliqueSearch (see Section III) was proposed in [8], and it wasshown to be able to improve best-known results from theliterature for many instances with parameter settings appro-priate to the frequency hopping applications (29 ≤ n ≤ 63and 5 ≤ w ≤ 8 with d = 2w− 2, d = 2w− 4 or d = 2w− 6),for which mathematical constructions were not very welldeveloped previously (the interested reader is referred to [8]for a detailed description of parameter tuning and experimentalsettings). The instances improved with respect to the state-of-the-art are summarised in Table I, where the new lowerbounds provided by the VNS method (New LB) are comparedwith the previously best-known results (Old LB). Many resultswere improved, especially for large values of n. These wereinstances for which the previous methods used were not par-ticularly effective. Notice that most of the instances reported inthe table were later further improved by other methods, manyof which again make use of heuristic criteria. See [13] for fulldetails.
V. QUATERNARY DNA CODES
Quaternary DNA codes are sets of words of fixed length nover the alphabet A, C, G, T. The words of a code haveto satisfy the following combinatorial constraints. For eachpair of words, the Hamming distance has to be at least d(constraint HD); a fixed number (here taken as bn/2c) of lettersof each word have to be either G or C (constraint GC); theHamming distance between each word and the Watson-Crickcomplement (or reverse-complement) of each word has to be atleast d (constraint RC), where the Watson-Crick complementof a word x1x2 . . .xn is defined as xnxn−1 . . .x1 with A = T ,T = A, C = G, G = C. If the number of letters which are Gor C in each word is bn/2c, then AGC
4 (n,d,bn/2c) is used todenote the maximum number of words in a code satisfyingconstraints HD and GC. AGC,RC
4 (n,d,bn/2c) is used to denotethe maximum number of words in a code satisfying constraintsHD, GC and RC.
Quaternary DNA codes have applications to informationstorage and retrieval in synthetic DNA strands. They areused in DNA computing, as probes in DNA microarray
The Eighth International Conference on Computing and Information Technology IC2IT 2012
129
technologies and as molecular bar codes for chemical libraries[5]. Constraints HD and RC are used to make unwantedhybridisations less likely, while constraint GC is imposed toensure uniform “melting temperatures”, where DNA meltingis the process by which double-stranded DNA unwinds andseparates into single strands through the breaking of hydrogenbonding between the bases. Such constraints have been used,for example, in [2], [14], where more detailed technicalmotivations for the constraints can be found. Lower boundsfor Quaternary DNA codes obtained using different tools suchas mathematical constructions, stochastic searches, template-map strategies, genetic algorithms and lexicographic searcheshave been proposed (see [7], [14], [2], [15], [10], [5], [16],[17] and [18].
A VNS method embedding all the local search routinesdescribed in Section III was implemented in [5], [16]. Exper-iments were conducted for AGC
4 (n,d,w) and AGC,RC4 (n,d,w)
with 4 ≤ n ≤ 20, 3 ≤ d ≤ n and 21 ≤ n ≤ 30, 13 ≤ d ≤ n. InTable II the new lower bounds (New LB) retrieved by VNSduring the experiments that improved the previous state-of-the-art results (Old LB) are summarised. It is interesting toobserve how substantial the improvements for this family ofcodes are sometimes (see, for example, AGC
4 (19,10,9) andAGC
4 (19,11,9)). The reader is referred to [16] for a detaileddescription of parameter tuning and experimental settings.
VI. PERMUTATION CODES
A permutation code is a set of permutations in the sym-metric group Sn of all permutations on n elements. Thewords are the permutations and the code length is n. Theability of a permutation code to correct errors is related tothe minimum Hamming distance of the code. The minimumHamming distance d is then the minimum distance taken overall pairs of distinct permutations. The maximum number ofwords in a code of length n with minimum distance d isdenoted by M(n,d).
Permutation codes (sometimes called permutation arrays)have been proposed in [19] for use with a specific modulationscheme for powerline communications. An account of therationale for the choice of permutation codes can be foundin [3]. Permutations are used to ensure that power outputremains as constant as possible. As well as white Gaussiannoise the codes must combat permanent narrow band noisefrom electrical equipment or magnetic fields, and impulsivenoise.
A central practical question in the theory of permutationcodes is the determination of M(n,d), or of good lower boundsfor M(n,d). The most complete contribution to this questionis in [3]. More recently, different methods, both based onpermutation groups and heuristic algorithms, have been pre-sented in [20]. In this paper a VNS approach involving CliqueSearch only (basically an Iterated Clique Search method)was introduced among other approaches. In some cases themethod was run on cycles of words of length n or n− 1instead of words. This reduces the complexity of the problem,making it tractable by the VNS approach. Experimental results
TABLE IIIMPROVED QUATERNARY DNA CODES.
Problem Old New Problem Old NewLB LB LB LB
AGC4 (7,3,3) 280 288 AGC,RC
4 (12,7,6) 83 87AGC
4 (7,4,3) 72 78g AGC,RC4 (12,8,6) 28 29
AGC4 (8,5,4) 56 63 AGC,RC
4 (12,9,6) 11 12AGC
4 (8,6,4) 24 28 AGC,RC4 (13,5,6) 3954 3974
AGC4 (9,6,4) 40 48 AGC,RC
4 (13,7,6) 205 206AGC
4 (9,7,4) 16 18 AGC,RC4 (13,8,6) 61 62
AGC4 (10,4,5) 1710 2016g AGC,RC
4 (13,9,6) 22 23AGC
4 (10,7,5) 32 34 AGC,RC4 (13,10,6) 9 10
AGC4 (11,7,5) 72 75 AGC,RC
4 (14,9,7) 46 49AGC
4 (11,9,5) 10 11 AGC,RC4 (14,10,7) 16 20
AGC4 (12,7,6) 179 183 AGC,RC
4 (14,11,7) 7 8AGC
4 (12,8,6) 68 118 AGC,RC4 (15,6,7) 6430 6634
AGC4 (12,9,6) 23 24 AGC,RC
4 (15,8,7) 343 347AGC
4 (13,9,6) 44 46 AGC,RC4 (15,9,7) 102 109
AGC4 (14,11,7) 16 17 AGC,RC
4 (15,10,7) 35 37AGC
4 (15,9,7) 225 227 AGC,RC4 (16,9,8) 230 243
AGC4 (15,11,7) 30 34 AGC,RC
4 (16,10,8) 74 83AGC
4 (15,12,7) 13 15 AGC,RC4 (17,9,8) 549 579
AGC4 (17,13,8) 22 24 AGC,RC
4 (17,10,8) 164 175AGC
4 (18,11,9) 216 282 AGC,RC4 (17,11,8) 56 62
AGC4 (18,13,9) 38 46 AGC,RC
4 (17,13,8) 11 12AGC
4 (18,14,9) 18 20 AGC,RC4 (18,9,9) 1403 1459
AGC4 (19,10,9) 1326 2047 AGC,RC
4 (18,10,9) 387 407AGC
4 (19,11,9) 431 615 AGC,RC4 (18,11,9) 104 133
AGC4 (19,12,9) 163 213 AGC,RC
4 (18,12,9) 43 49AGC
4 (19,13,9) 71 83 AGC,RC4 (18,13,9) 19 21
AGC4 (19,14,9) 33 38 AGC,RC
4 (18,14,9) 9 10AGC
4 (19,15,9) 15 17 AGC,RC4 (19,9,9) 3519 3678
AGC4 (20,13,10) 130 167 AGC,RC
4 (19,10,9) 909 960AGC
4 (20,14,10) 58 69 AGC,RC4 (19,11,9) 215 285
AGC4 (20,15,10) 31 33 AGC,RC
4 (19,12,9) 80 99AGC
4 (20,16,10) 13 16 AGC,RC4 (19,13,9) 35 39
AGC,RC4 (9,6,4) 20 21 AGC,RC
4 (19,14,9) 16 18AGC,RC
4 (10,5,5) 175 176 AGC,RC4 (19,15,9) 7 8
AGC,RC4 (10,7,5) 16 17 AGC,RC
4 (20,13,10) 64 77AGC,RC
4 (11,7,5) 36 37 AGC,RC4 (20,14,10) 29 33
AGC,RC4 (11,8,5) 13 14 AGC,RC
4 (20,15,10) 14 15AGC,RC
4 (12,5,6) 1369 1381 AGC,RC4 (20,16,10) 6 7
g Later improved in [16] by a heuristic approach based on anEvolutionary Algorithm.
(see [20] for details on parameter tuning and experimentalsettings) were discussed for 6 ≤ n ≤ 18 and 4 ≤ d ≤ 18,plus M(19,17) and M(20,19). The new best-known resultsretrieved by VNS (New LB) are summarised in Table III,where they are compared with the previous state-of-the-artresults (Old LB). Superscripts reflect the domain on which theVNS method was run. Beside providing the first non-trivialbound for some of the instances, the algorithm was also ableto provide substantial improvements over the previous best-known results (see, for example, M(15,13)).
VII. PERMUTATION CODES WITH SPECIFIED PACKINGRADIUS
Using the notation introduced in the previous section, a ballof radius e surrounding a word w∈Sn is composed of all the
The Eighth International Conference on Computing and Information Technology IC2IT 2012
130
TABLE IIIIMPROVED PERMUTATION CODES.
Problem Old New Problem Old NewLB LB LB LB
M(13,8) - 27132h M(15,14) - 56M(13,9) 3588 4810 M(16,13) - 1266M(13,10) - 906 M(16,14) - 269M(13,11) - 195i M(18,17) 54 70M(14,13) - 52 M(19,17) - 343M(15,11) - 6076h M(20,19) - 78M(15,13) 84 243h VNS on cycles of words of length n−1 instead of
words.i VNS on cycles of words of length n instead of words.
permutations of Sn with Hamming distance from w at moste. Given a permutation code C, the packing radius of C isdefined as the maximum value of e such that the balls ofradius e centred at words of C do not overlap. The maximumnumber of permutations of length n with packing radius atleast e is denoted by P[n,e].
From a practical point of view, a permutation code (see Sec-tion VI) with d = 2e+1 or d = 2e+2 can correct up to e errors.On the other hand, it is known that in an (n,2e) permutationcode the balls of radius e surrounding the codewords may allbe pairwise disjoint, but usually some overlap. Thus an (n,2e)permutation code is generally unable to correct e errors usingnearest neighbour decoding. On the other hand, a permutationcode with packing radius e (denoted [n,e]) can always correct eerrors. Thus, the packing radius more accurately specifies therequirement for an e-error-correcting permutation code thandoes the minimum Hamming distance [21].
A basic VNS algorithm involving Clique Search only (Iter-ated Clique Search) was presented, among other methods, in[21]. The method was tested on instances with 4≤ n≤ 15 and2≤ e≤ 6 (all parameter tunings and experimental settings aredescribed in the paper). The new best-known lower boundsretrieved by the VNS method (New LB) are summarised inTable IV, comparing it with the previous state-of-the-art bound(Old LB). Notice that also in this case superscripts reflect thedomain on which the VNS method was run. As in Section III,for complexity reasons, it was sometimes convenient to runthe method on cycles of words of length n or n− 1 insteadof words. From the results of Table IV it can be observedhow the improvements over the previous state-of-the-art aresometimes remarkable (see, for example, P[14,4]).
VIII. CONCLUSIONS
A heuristic framework based on Variable NeighbourhoodSearch for code design has been described. Experimentalresults carried out on four different code families, used indifferent applications, have been presented. Parameter tuninghas been carried out for all algorithms used for these appli-cations, and is described in the referenced papers. However,it has been observed that the exact choice of parameters isnot particularly critical. From the experiments it is clear thatheuristics are a valuable additional tool in the design of newimproved codes.
TABLE IVIMPROVED PERMUTATION CODES WITH A SPECIFIED
PACKING RADIUS.
Problem Old New Problem Old NewLB LB LB LB
P[5,2] 5 10 P[12,5] 60 144j
P[6,2] 18 30 P[12,6] - 12P[6,3] - 6 P[13,4] 4810 15120k
P[7,2] 77 126 P[13,5] 195 612k
P[7,3] 7 22 P[13,6] 13 40P[8,4] - 8 P[14,4] 6552 110682k
P[9,4] 9 25 P[14,5] 2184 3483P[10,4] 49 110j P[14,6] 52 169k
P[10,5] - 10 P[15,6] 243 769P[11,5] 11 33j VNS on cycles of words of length n instead of words.k VNS on cycles of words of length n−1 instead of
words.
ACKNOWLEDGMENT
R. Montemanni and M. Salani acknowledge the support ofthe Swiss Hasler Foundation through grant 11158: “Heuristicsfor the design of codes”.
REFERENCES
[1] M. K. Gupta, “The quest for error correction in biology,” IEEE Engi-neering in Medicine and Biology Magazine, vol. 25, no. 1, pp. 46–53,2006.
[2] O. D. King, “Bounds for DNA codes with constant GC-content,”Electronic Journal of Combinatorics, vol. 10, #R33, 2003.
[3] W. Chu, C. J. Colbourn and P. Dukes, “Constructions for permutationcodes in powerline communications,” Designs, Codes and Cryptography,vol. 32, pp. 51–64, 2004.
[4] P. Hansen and N. Mladenovic, “Variable neighbourhood search: prin-ciples and applications,” European Journal of Operational Research,vol. 130, pp. 449–467, 2001.
[5] R. Montemanni and D. H. Smith, “Construction of constant GC-contentDNA codes via a variable neighbourhood search algorithm,” Journal ofMathematical Modelling and Algorithms, vol. 7, pp. 311–326, 2008.
[6] A. E. Brouwer, J. B. Shearer, N. J. A. Sloane, and W. D. Smith, “Anew table of constant weight codes,” IEEE Transactions on InformationTheory, vol. 36, pp. 1334–1380, 1990.
[7] Y. M. Chee and S. Ling, “Improved lower bounds for constant GC-content DNA codes,” IEEE Transactions on Information Theory, vol. 54,no. 1, pp. 391–394, 2008.
[8] R. Montemanni and D. H. Smith, “Heuristic algorithms for construct-ing binary constant weight codes,” IEEE Transactions on InformationTheory, vol. 55, no. 10, pp. 4651–4656, 2009.
[9] R. Carraghan and P. Pardalos, “An exact algorithm for the maximumclique problem,” Operations Research Letters, vol. 9, pp. 375–382, 1990.
[10] D. C. Tulpan, H. H. Hoos, and A. E. Condon, “Stochastic local searchalgorithms for DNA word design,” Lectures Notes in Computer Science,Springer, Berlin, vol. 2568, pp. 229–241, 2002.
[11] P. J. Kuekes, W. Robinett, R. M. Roth, G. Seroussi, G. S. Snider, andR. S. Williams,“Resistor-logic demultiplexers for nanoelectronics basedon constant-weight codes” Nanotechnology, vol. 17, pp. 1052–1061,2006.
[12] J. N. J. Moon, L. A. Hughes, and D. H. Smith, “Assignment of frequencylists in frequency hopping networks,” IEEE Transactions on VehicularTechnology, vol. 54, no. 3, pp. 1147–1159, 2005.
[13] A. E. Brouwer, “Bounds for binary constant weight codes.http://www.win.tue.nl/∼aeb/codes/Andw.html.”
[14] P. Gaborit and O. D. King, “Linear construction for DNA codes,”Theoretical Computer Science, vol. 334, pp. 99–113, 2005.
[15] D. C. Tulpan and H. H. Hoos, “Hybrid randomised neighbourhoodsimprove stochastic local search for DNA code design,” Lecture Notes inComputer Science, Springer, Berlin, vol. 2671, pp. 418–433, 2003.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
131
[16] R. Montemanni, D. H. Smith, and N. Koul, “Three metaheuristics for theConstruction of Constant GC-content DNA codes,” in: S. Voss and M.Caserta (eds.), Metaheuristics: Intelligent Decision Making, (OperationsResearch / Computer Science Interface Series). Springer-Verlag NewYork, 2011.
[17] D. H. Smith, N. Aboluion, R. Montemanni, and S. Perkins, “Linear andnonlinear constructions of DNA codes with Hamming distance d andconstant GC-content,” Discrete Mathematics, vol. 311, no. 14, pp. 1207–1219, 2011.
[18] N. Aboluion, D. H. Smith and S. Perkins, “Linear and nonlinearconstructions of DNA codes with Hamming distance d, constant GC-content and a reverse-complement constraint,” Discrete Mathematics,vol. 312, no. 5, pp. 1062–1075, 2012.
[19] N. Pavlidou, A.J. Han Vinck, J. Yazdani and B. Honary, “Power linecommunications: state of the art and future trends,” IEEE Communica-tions Magazine, vol. 41, no. 4, pp. 34–40, 2003.
[20] D. H. Smith and R. Montemanni, “A new table of permutationcodes,” Designs, Codes and Cryptography, Online First, 2011, DOI10.1007/s10623-011-9551-8.
[21] D. H. Smith and R. Montemanni, “Permutation codes with specifiedpacking radius,” Designs, Codes and Cryptography, Online First, 2012,DOI: 10.1007/s10623-012-9623-4.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
132
Spatial Join with R-Tree on Graphics Processing Units
Tongjai Yampaka Department of Computer Engineering
Chulalongkorn University
Bangkok, Thailand
Prabhas Chongstitvatana Department of Computer Engineering
Chulalongkorn University
Bangkok, Thailand
Abstract: Spatial operations such as spatial join combine two
objects on spatial predicates. It is different from relational
join because objects have multi dimensions and spatial join
consumes large execution time. Recently, many researches
tried to find methods to improve the execution time. Parallel
spatial join is one method to improve the execution time.
Comparison between objects can be done in parallel. Spatial
datasets are large. R-Tree data structure can improve the
performance of spatial join.
In this paper, a parallel spatial join on Graphic processor
unit (GPU) is introduced. The capacity of GPU which has
many processors to accelerate the computation is exploited.
The experiment is carried out to compare the spatial join
between a sequential implementation with C language on
CPU and a parallel implementation with CUDA C language
on GPU. The result shows that the spatial join on GPU is
faster than on a conventional processor.
Keyword: Spatial Join, Spatial Join with R-tree, Graphic
processing unit
I. INTRODUCTION
The evolution of Graphic Processing Unit is driven by
the demand for real time, high-definition and 3-D
graphics. The requirement for an efficient and fast
computation has been met by parallel computation [1]. In
addition, GPU architecture that supports parallel
computation is programmable to solve other problems.
This new trend is called General Purpose computing on
Graphic processors (GPGPU). Developers can use the
capacity of GPU to solve other problem beside graphics
and can improve the execution time by parallel
computation. In a spatial database, storing and managing
complex and large datasets such as Graphic Information
system (GIS) and Computer-aided design (CAD) are time
consuming. A spatial database characteristic is different
from a relational database because of data type. Spatial
data types are point, line and polygon. The type of data
depends on the characteristic of objects, for example a
road is represented by a line or a city is represented by a
polygon. An object shape is created by x, y and z
coordinates. Therefore, spatial operations in a spatial
database are not the same as operations in a relational
database. There are specific techniques for spatial
operations.
Spatial join combines between two objects on spatial
predicates, for example, find intersection between two
objects. It is an expensive operation because spatial
datasets can be complex and very large. Their processing
cost is very high. To solve this problem R-Tree is used to
improve the performance for accessing data in spatial join.
Spatial objects are indexed by spatial indexing [2] [3].
The objects are represented by minimum bounding
rectangles which cover them. An internal node points to
children nodes that are covered by their parents. A leaf
node points to real objects. The join with R-Tree begins
with a minimum bounding rectangle. The test for an
overlap is performed from a root node to a leaf node. It is
possible that there are overlaps in sub-trees too.
The previous work [4] introduces a technique for
spatial join that can be divided into two steps.
• Filter Step: This step computes an approximation of
each spatial object, its minimum bounding rectangle. This
step produces rectangles that cover all objects.
• Refinement Step: In this step, spatial join predicates
are performed over each object.
Recently, spatial join techniques have been proposed
in many works. In a survey [5], many techniques to
improve spatial join are described. One technique shows a
parallel spatial join that improves the execution time for
this operation.
This paper presents a spatial join with R-Tree on
Graphic processing units. The parallel step is executed for
testing an overlap. The paper is organized as follow.
Section 2 explains the background and reviews related
works. Section 3 describes the spatial join with R-Tree on
Graphic processing units. Section 4 explains the
experiment. The results are presented in Section 5.
Section 6 concludes the paper.
II. BACKGROUND AND RELATED WORK
A. Spatial join with R-Tree
Spatial join combines two objects with spatial
predicates. Objects have multi-dimension so it is
important to efficiently retrieve data. In a survey [5],
techniques of spatial join are presented. Indexing data
such as R-Tree is one method which improves I/O time. In
[6], R-Tree is used for spatial join. Before executing a
spatial join predicate in the leaf level, an overlap between
two objects from parent nodes is tested. When parent
nodes are overlapped the search is continue into sub-trees
that are covered by its parents. The sub-trees which are
not overlapped from parent nodes are ignored. The reason
is that the overlapped parent nodes are probably
The Eighth International Conference on Computing and Information Technology IC2IT 2012
133
overlapped with leaf nodes too. The next step, the overlap
function test is called with sub-trees recursively. This
algorithm is shown in Figure 1
SpatialJoin(R,S):
For (all ptrS S) Do
For (all ptrR R with ptrR.rect ptrS.rect ≠ ) Do
If (R is a leaf node) Then
Output (ptrR , ptrS )
Else
Read (ptrR.child); Read (ptrS.child)
SpatialJoin(ptrR.child, ptrS.child)
End
End
End SpatialJoin;
Figure 1 Spatial join with R-Tree
The work [6] presents a spatial join with R-Tree that
improves the execution time. However, this algorithm is
designed for a single-core processor. The proposed
algorithm is based on this work but the implementation is
on Graphics Processing Units.
B. Parallel spatial join with R-Tree
To reduce the execution time of a spatial join, a
parallel algorithm can be employed. The work in [7]
describes a parallel algorithm for a spatial join. A spatial
join has two steps: filter step and refinement step. The
filter step uses an approximation of the spatial objects,
e.g. the minimum bounding rectangle (MBR).
The filter admits only objects that are possible to
satisfy the predicate. A spatial object is defined in the
form MBRi,IDi where i is a key-pointer data for the
object. The output of this step is the set
[MBRi,IDi,MBRj,IDj] if MBRi intersects with
MBRj. Each pair is called a candidate pair. The next step
is the refinement step. Pair of candidate objects is
retrieved from the disk for performing a join predicate. To
retrieve data, it reads the pointers from IDi and IDj. The
algorithm creates tasks for testing an overlap in the filter
step in parallel. For example in Figure 2, R and S denote
spatial relations.
The set R1,R2,R3,R4,R5,R6,…,RN is in R root and
the set S1,S2,S3,S4,S5,S6,…,SN is in S root. In the
algorithm described here the filter step is done in parallel.
Root R
Root S
R root = R1, R2, R3, R4, R5
S root = S1, S2, S3, S4
Task1 (R1,S1) Task2(R1,S2)
Task3 (R1,S3) Task4(R1,S4) Task created
…
TaskN (RN,SN) TaskN(RN,SN)
Figure 2 Filter task creation and distribution
in Parallel for R-tree join
The algorithm is designed for parallel operation on a
CPU. In this paper we use the same idea for the algorithm
but it is implemented on a GPU.
In other research [8], R-Tree is used in parallel search.
The algorithm distributes objects to separate sites and
creates index data objects from leaves to parents. Every
parent has entries to all sites. A search query such as
windows query can perform search in parallel.
C. Spatial query on GPU
For a parallel operation in GPU, the work in [9]
implements a spatial indexing algorithm to perform a
parallel search. A linear-space search algorithm is
presented that is suitable for the CUDA [1] programming
model. Before the search query begins, a preparation of
data array is required for the R-Tree. This is done on
CPU. Then the data array is loaded into device memory.
The search query is launched on GPU threads. The data
structure has two data arrays represented in bits. The
arithmetic at bit level is exploited. The first array stores
MBR co-ordinate referred to the bottom-left and top-right
co-ordinates of the i MBR in the index. The second array
is an array of R-Tree nodes. R-Tree nodes store the set
MBRi, childNode|t|. ChildNode|t| is an index into the
array representing the children of the node i. When the
search query is called, the GPU kernel creates threads to
execute the tasks. Then copy two data arrays to memory
on device. Finally the main function in GPU is called. The
algorithm is shown in Figure 3. The result is copied back
to CPU when the execution on GPU is finished.
Clear memory array (in parallel).
For each thread
if Search[i] is:
For each search[i] overlaps with the query MBR node j:
If the child node j is a leaf, mark it as part of the
output.
If the child node j is not a leaf, mark it in the
Next Search array.
Sync Threads
Copy next Search array into Search[i] (in parallel).
Figure 3 R-Tree Searches on GPU
III. IMPLEMENTATION
A. Overview of the algorithm
Most works have focused on the improvement of the
filter step. The first filter step assumes that the
computation is done with MBR of the spatial objects. In
this paper, this step is performed on CPU and the data set
is assumed to be in the data arrays. The algorithm begins
by parallel filtering objects on GPU. The steps of the
algorithm are as follows.
• Step 1: The data arrays required for the R-Tree are
mapped to the device memory. The data arrays are
prepared on CPU before sending them to device.
R1
R4
S1 R5
R2 S4
S2 S3 R3
The Eighth International Conference on Computing and Information Technology IC2IT 2012
134
A
D
B
E
C
A
D
B
E
C
• Step 2: Filtering step, a function to find an overlap
between two MBR objects is called. Threads are created
on GPU for execution in parallel. The results are the set of
MBRs which are overlapping.
• Step 3: Find leaf nodes, the results of step 2, the set
of MBRs, are checked whether they are in the leaf nodes
or not. If they are the leaf nodes, return the set as the
result and send them to the host. If they are not the leaf
nodes and then they are used as input again recursively
until reaching leaf nodes.
B. Data Structure in the algorithm
Assume MBRs objects are stored in a table or a file. In
the join operation, there are two relations denote as R and
S. MBRs structure (shown in C language syntax) are in
the form: Struct MBR_object
int min_x,max_x,min_y,max_y; ; /*x, y coordinate rectangle of object*/ Struct MBR_root
int min_x,max_x,min_y,max_y; child[numberOfchild];
; /*x, y coordinate rectangle of root*/ MBR_root rootR [numberOfrootR]; MBR_root rootS [numberOfrootS]; /*Array of rootR and rootS relation*/ MBR_object objectR [numberOfobjectR]; MBR_object objectS [numberOfobjectS]; /*Array of objectR and objectS relation*/
C. R-Tree Indexing
An R-Tree is similar to a B-Tree which the index is
recorded in a leaf node and it points to the data object [4].
All minimum bounding rectangles are created by x, y
coordinates of objects. The index of data is created by
packing R-Tree technique [10]. The technique is divided
into three steps:
1) Find the amount of objects per pack. The number of
child is between a lower bound (m) and an upper bound
(M) values.
2) Sort the data on x or y coordinates of rectangle.
3) Assign rectangles from the sort list to the pack
successively, until the pack is full. Then, find min x, y
and max x, y for each pack to create the root node.
Figure 4 MBRs before split node R-Tree
An example is shown in Figure 4. It has five
rectangles of objects. The objects are ordered according to
x-coordinate of the rectangle. The sorted list is A, D, B,
E, C. Define objects per pack as three. The assignments
of objects into packs are:
Pack1 = A, D, B
Pack2 = C, E
In the next step, a root is created. Compute min x,
min y and max x, max y.
Pack1 Pack2
Max y
R1 R2
Min y
Min x Max x
Figure 5 MBRs after split node R-Tree
The root node of pack1 is R1 and the root node of
pack2 is R2. R1 points to three objects: A, D and B. R2
points to two objects: C and E. The root coordinate is
computed from min x, min y max x, max y of all objects
which the root covers them. In the example, only one
relation is shown.
R-Tree creation is done on CPU. The difference is in
the spatial join operation. The spatial join on CPU is
sequential and on GPU is parallel.
D. Spatial join on GPU
To parallelize a spatial join, the data preparation is
carried out on CPU, such as MBRs calculation and
splitting R-Tree nodes. In GPU, the overlap function and
the intersection join function are executed in parallel.
1) Algorithm
• Overlap: This step is the filter step for testing the
overlap between root nodes R and S.
1. Load MBR data arrays (R and S) to GPU.
2. Test the overlap Ri and Sj in parallel.
3. The overlap function call is:
Overlap ((Sj.x_min < Ri.x_max)
and (Sj.x_max > Ri.x_min)
and (Sj.y_min < Ri.y_max)
and (Sj.y_max > Ri.y_min))
4. For each Ri overlap Sj
5. Find Ri and Sj children nodes.
• Find children: Find children nodes which are covered
by the root Ri and Sj.
a) The information from MBRs indicates the children
that are covered by the root.
b) Load children data and send them to the overlap
function.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
135
• Test intersection: This is the refinement step. Compute
the join predicate on all children of Ri and Sj using the
overlap function above.
2) GPU Program Structure
CUDA C language is used. The language has been
designed to facilitate graphic rendering on Graphics
processing units. CUDA program has two phases [11]. In
the first phase, the program on CPU, called host code,
performs the data initialization and transfers data from
host to device or from device to host. On the second
phase, the program on GPU, called the device code,
makes use of the CUDA runtime system to generate
threads for execution of functions. All threads execute the
same code but operate on different data at the same time.
A CUDA function uses the keyword “__global__” to
define function that is a kernel function. When the kernel
function is called from the host, CUDA generates a grid of
threads on the device.
In the spatial join, the overlap function is distributed to
different blocks and is executed at the same time with
different data objects. To divide the task, every block has
a block identity calls blockIdx.
For example:
Objects Relation R = Robject0, Robject1, Robject2,..,RobjectN,
Relation S = Sobject0, Sobject1, Sobject2,…,SobjectN
Overlap function: Compare all objects. Find x and y
coordinates in the intersection predicate.
The sequential program on CPU executes only one
pair of data at the one time.
Robject0 compare Sobject0
Robject0 compare Sobject1
Robject0 compare Sobject2
...
RobjectN compare SobjectN..timeN
On GPU, the CUDA code on device generates blocks
for execution all data on different blocks.
Block0 = Robject0 compare Sobject0
Block1 = Robject0 compare Sobject1
Block2 = Robject0 compare Sobject2
...
BlockN = RobjectN compare SobjectN
The memory is allocated for execution between CPU
and GPU. First, allocate memory for data structure of root
R-Tree and MBRs of objects. Second, allocate memory of
data arrays to store results. When the task is done copy
data arrays back to host.
The nested loop is transformed to run in parallel. The
rectangle of objects are mapped to 2D block on GPU. The
outer loop is mapped to blockIdx.x and the inner loop is
mapped to threadIdx.y.
The call to kernel function is:
kernel<<<number of outer loop,number of inner loop>>>.
CUDA kernel generates blocks and threads for execution.
IV. EXPERIMENTATION
A. Platform
The spatial join is coded in C language for sequential
version. CUDA C language is used in parallel version.
Both versions run on Intel Core i3 2.93 GHz DDR3 2048
MB memory. GPU NVIDIA GT440 1092 MHz. 1024 MB
and CUDA 96 Cores.
B. Dataset
In the experiment, the dataset is retrieved from R-Tree
portal [12]. In the data preparation step the minimum
bounding rectangles are pre-computed. The attributes in
the dataset consist of Roads join River in Greece, Streets
join Real roads in Germany.
TABLE I DATASET IN EXPERIMENTATION
Pair of dataset Amount
MBRs
Data size
Greece
Rivers join Roads 47,918 0.7 MB
Germany
Streets join Real roads 67,008 0.6 MB
Table 1 shows the number of MBRs and the size of
dataset. All datasets are in text file. A C function is used
to read data from a text file to data arrays.
V. RESULT
Spatial join is tested with dataset in Table 1 with two
functions (Overlap function of root nodes and Intersection
function of children nodes). In the experiment, the time to
read data from text files and stores them to data arrays is
ignored. The execution time of spatial join operation
between CPU and GPU is compared. The generation of R-
Tree is done on CPU in both sequential and parallel
version. Only the spatial join operations are different.
A. Performance comparison between sequential and
parallel
The results are divided into two functions: overlap
and intersect.
TABLE II EXECUTION TIME ON GPU AND CPU
Pair of
dataset
Overlap
(ms)
Intersection
(ms)
Total
(ms)
CPU GPU CPU GPU CPU GPU
Greece
Rivers join
Roads
18 4 72.67 22.33 90.67 26.33
Germany Streets
join Real roads
5.33 4 74.00 39.67 79.33 43.67
The result in Table 2 shows that the execution time on
GPU is faster than on CPU. For the dataset 1, the overlap
function on GPU is 77.78% faster (4 ms versus 18 ms or
about 4x); the intersection function is 69.27% faster (3x).
The total execution time on GPU is 70.96% faster (3.4x).
For the dataset 2, the overlap function on GPU is 25%
faster (1.3x); the intersection function is 46.40%
The Eighth International Conference on Computing and Information Technology IC2IT 2012
136
faster (1.8x). The total execution time on GPU is 44.96%
faster (1.8x). The speedup depends on the data type as
well. If data has larger numbers, the execution time is
longer too. In the experiment, the dataset 1 is floating
point data. It has six digits per one element. Execution
time is higher than the dataset 2 because the dataset 2 has
integer data. It has four digits per one element.
The time to transfer data is significant. The data
transfer time affected the execution time. The total
running time in Table 2 includes the data transfer time
from host to device and device to host.
Figure 6 Transfer rate dataset 1, dataset 2
Figure 6 shows the data transfer rate on GPU. The
dataset 1 has 47,918 records and its size is 0.7 MB. The
data transfer time of this dataset is 59.53% of the
execution time. The dataset 2 has 67,008 records and is
0.6 MB. The data transfer time of this dataset is 76.83%
of the execution time.
VI. CONCLUSION
This paper describes how a spatial join operation with
R-Tree can be implemented on GPU. It uses the multi-
processing units in GPU to accelerate the computation.
The process starts with splitting objects and indexing data
in R-Tree on the host (CPU) and copies them to the device
(GPU). The spatial join makes use of the parallel
execution of functions to perform the calculation over
many processing units in GPU.
However using Graphic Processor Unit to perform
general purpose task has limitations. The symbiosis
between CPU and GPU is complicate. There is a need to
transfer data back and forth between CPU and GPU and
the data transfer time is significant. Therefore, it may be
the case that the data transfer time will dominate the total
execution time if the task and the data are not carefully
divided.
The future work will be on how to automate and
coordinate the task between CPU and GPU. There are
other database management functions that are suitable to
be implemented in GPU too. It is worth the investigation
as GPU becomes ubiquitous nowadays.
REFERENCE
[1] NVIDIA CUDA Programming Guide, 2010. Retrieve
at http://developer.download.nvidia.com
[2] A. Nanopoulos, A. N. Papadopoulos and Y.
Theodoridis Y. Manolopoulos, R-trees: Theory and
Applications, Springer, 2006.
[3] Xiang Xiao and Tuo Shi, "R-Tree: A Hardware
Implemention," Int. Conf. on Computer Design, Las
Vegas, USA, July 14-17, 2008.
[4] Gutman A., "R-tree:A Dinamic Index Structure for
Spatial Searching," ACM SIGMOD Int. Conf. , 1984.
[5] E.H. Jacox and H. Samet, "Spatial Join Techniques,"
ACM Trans. on Database Systems, Vol.V, No.N,
November 2006, Pages 1–45.
[6] Hans P. Kriegel and B. Seeger T. Brinkhoff, "Efficient
Processing of Spatial Joins Using R-tree," SIGMOD
Conference, 1993, pp.237-246.
[7] L. Mutenda and M. Kitsuregawa, "Parallel R-tree
Spatial Join for a Shared-Nothing Architecture," Int. Sym.
on Database Applications in Non-Traditional
Environments, Japan, 1999, pp.423-430.
[8] H. Wei, Z. Wei, Q. Yin, "A New Parallel Spatial
Query Algorithm for Distributed Spatial Database," Int.
Conf. on Machine Learning and Cybernetics, 2008, Vol.3,
pp.1570-1574.
[9] M. Kunjir and A. Manthramurthy, "Using Graphics
Processing in Spatial Indexing Algorithm", Research
report, Indian Institute of Science, 2009.
[10] K. Ibrahim and F. Cristos, "On Packing R-tree," Int.
Conf. on Information and knowledge management, ACM,
USA, 1993, pp.490-499.
[11] David B. Kirk and Wen-mei W. Hwu, Programming
Massively Parallel Processors A Hands-on Approach,
Morgan Kaufmann, 2010.
[12] R-tree Portal. [Online]. http://www.rtreeportal.org
The Eighth International Conference on Computing and Information Technology IC2IT 2012
137
Ontology Driven Conceptual Graph Representationof Natural Language
Supriyo GhoshDepartment of Information Technology
National Institute of Technology,DurgapurDurgapur, West Bengal,713209,IndiaEmail:[email protected]
Prajna Devi UpadhyayDepartment of Information Technology
National Institute of Technology,DurgapurDurgapur, West Bengal,713209,India
Email: [email protected]
Animesh DuttaDepartment of Information Technology
National Institute of Technology,DurgapurDurgapur, West Bengal,713209,India
Email: [email protected]
Abstract—In this paper we propose a methodology to converta sentence of natural language to conceptual graph, which is agraph representation for logic based on the semantic networks ofArtificial Intelligence and the existential graph. A human beingcan express the same meaning in different form of sentences.Although many natural language interfaces(NLIs) have beendeveloped, but they are domain specific and require a hugecustomization for each new domain. From our approach acasual user can get more flexible interface to communicate withcomputer and less customization is required to shift from onedomain to another. Firstly, a parsing tree is generated from theinput sentence. From the parsing tree, each lexeme of the sentenceis found and the basic concepts matching with the ontology issorted out. Then relationship between them is found by consultingwith domain ontology and finally the conceptual graph is built.
I. I NTRODUCTION
Now a days it is a challenging work to develop a methodol-ogy by which a human being can communicate with computer.A human can communicate only by natural language, butcomputer can understand a formalized data structure likeconceptual graph. So, both can communicate with their propersemantics if they share a common vocabulary or ontology andthere exists a proper interface which can convert the naturallanguage into formalized data structure like conceptual graphand vice versa.
A. Conceptual Graph
A conceptual graph (CG)[1] is a graph representation forlogic based on the semantic networks of artificial intelligenceand the extential graphs of Charles Sanders Peirce[2]. Manyversion of conceptual graph have been designed and imple-mented for last thirty years. In the first published paper onCGs, [3] used them to represent the conceptual schemas usedin database system. The first book on CGs [1] applied them toa wide range of topics in artificial intelligence and computerscience. In [3] developed a version of conceptual graphs (CGs)as an intermediate language for mapping natural language toa relational database.
A conceptual graph is a bipartite graph of concept verticesalternate with (conceptual) relation vertices, where edges con-nect relation vertices to concept vertices [4]. Each conceptvertex, drawn as a box and labelled by a pair of a concepttype and a concept referent, represents an entity whose typeand referent are respectively defined by the concept type andthe concept referent in the pair. Each relation vertex, drawn asa circle and labelled by a relation type, represents a relationof the entities represented by the concept vertices connectedto it. Concepts connected to a relation are called neighbourconcepts of the relation.
B. Ontology
An ontology[5] is a conceptualization of an applicationdomain in a human-understandable and machine- readableform. It is used to reason about the properties of that domainand may be used to define that domain. As per definitionof Ontology , Ontology defines basic terms and relationcomprising the vocabulary of a topic area as well as therules for combining terms and relation to define extension tothe vocabulary [6], [7]. A survey of Web tools [8] presentedthat extraction ontologies provide resiliently and scalabilitynatively where in other approaches for information extraction,the problem of resiliently and scalability still remains. Oneserious difficulty in creating the ontology manually is the needfor a lot of time, effort and might contain errors. Also, itrequires a high degree of knowledge in both database theoryand Perl regular- expression syntax. Professional groups arebuilding metadata vocabularies or the ontologies. Large hand-built ontologies exist for example medical and geographicterminology. Researchers are rapidly working to build systemsto automate extracting them from huge volumes of text. Onemore complex problem is, no formalized rule is there to defineand build the ontology. In our work we have assumed thatall lexemas like noun,verb and adjectives of our experimentaldomain are defined as a concept or instance of a concept inthe domain ontology.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
138
This paper is structured as follows. The Related Workand Scope of the Work are presented in section II and III.Our proposed system overview and demonstration throughexamples is shown in section IV. Case study for differentsentences are given in section V. Finally, we conclude anddraw future directions in Section VI.
II. RELATED WORK
A lot of methodologies have been developed to capture themeaning of a sentence by converting the natural languageinto its corresponding conceptual graph. But as there is noformalized rule build an ontology, it is still a challengingworkto convert the whole set of natural language into a formalizedmachine understandable language.
In [9], the authors have built the conceptual graph fromnatural language. But they have not defined any grammer orrule to generate parsing tree of any sentence. They also havenot provided any idea to keep the same semantics by buildingunique conceptual graph for different sentences with samemeaning.
In [10], [11], [12], the authors have proposed a method-ology to develop a tool to overcome the negetive effects ofparaphrases, by converting a complex formed sentence into asimple format. In this approach a complex format query of thedomain of interest which cannot be recognized by the system,is rearranged into a simple machine understandable format.But they lack of converting different forms of sentence intoasingle data structure like conceptual graph by consulting withits domain ontology.
In [13], [4], the authors have built a query conceptual graphfrom a natural sentence by identifying the concept whichneeds a high computational cost. As they have not parsed thesentence, the searching cost of proper ontological conceptisvery high.
In [14], the author’s approach for building conceptual graphfrom natural language depend on only the semantics of verbs,which is not feasible for all the cases. In many existingontologies, nouns and verbs both perform a very importantrole to capture the semantic of the sentence.
In [15], [16], a natural language query is converted intoSPARQL query language and by consulting with its domainontology, the system generates answer. But this approachcannot capture the semantic of question always as the systemdoes not consult the ontology concepts when SPARQL queryis built.
The work presented in [17], [18] is related to our proposedapproach. Here after syntactic parsing of sentence systemgenerates the ontological concepts. For unrecognized conceptssystem generates some suggession and from the user’s se-lection system learns about the ranking of the suggessionin future. But this approach does not give us any notionof building same conceptual graph from various forms ofsentences with same semantic.
III. SCOPE OFWORK
Now a days people are trying to use the computer or asoftware tool like agent for task delegation. So, language ofthe human being must be converted into some formal datastructure which can be understood by the computer. A numberof methodologies have been developed to convert a naturallanguage into conceptual graph. But the semantics of thelanguage cannot be defined by the conceptual graph, until weuse a common vocabulary between user and computer. Thiscommon vocabulary can be expressed in the form of ontology.A casual user can define a single sentence in various waysthough all of these have same semantic. So our approach is topresent a methodology which can convert a natural languageinto corresponding conceptual graph by consulting with itsdomain ontology. So both, user and computer can understandthe semantic of the conversation. Our approach builds theunique conceptual graph for various sentences in differentforms but same semantics. Similarly a single word has anumber of synonyms and all synonyms may not be defined indomain ontology. So if a particular concept cannot be found inontology our system must check whether any of its synonymis defined in the domain ontology as a concept. The synonymmust be identified from WordNet [19] ontology.
IV. SYSTEM OVERVIEW
In this work we develop a methodology by which froma natural language query or sentence, a conceptual graph isgenerated using the defined concept and relationship betweenthe concepts of the domain ontology. Using this approach acasual user and computer can communicate, if they share acommon ontology or vocabulary. We develop the methodol-ogy of converting a sentence into conceptual graph by thefollowing four steps:
1. Grammer for accepting natural language.2. Parsing tree generation.3. Recognizing the ontological concepts.4. Creating Conceptual Graph by using the ontological
concepts.
A. Grammer for accepting natural language
In this section we have defined a grammer which canrecognize simple,compound and complex sentences. Thisgrammer restricts the user to give the input sentence in acorrect grammatical format. We define a context free grammerG where
G = (VN ,Σ, P, S)where,Non-Terminal: VN=S,VP,NP,PP,V,AV,NN,P,ADJ,CONJ,DW,DHere,
S=Sentence, VP=Verb Phrase,NP=Noun Phrase, PP=Preposition Phrase,AV=Auxiliary Verb, V=Verb,NN=Noun, P=Preposition,ADJ=Adjective, CONJ=Conjunction,D=Determiner, DW=Depending words for complex
sentence.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
139
Terminal: Σ=Any kind of valid english word likenoun,verb,auxiliary-verb, adjective,preposition, determinerand null(ǫ) also.
Production Rule: P is the production of the grammer.Every sentence recognized by this grammar must follow theseproduction rules. P consists of
1) For simple sentences: S⇒<NP><VP>VP⇒<V><NP>V ⇒<AV><V><P> | <V><CONJ><V>
NP⇒<D><ADJ><NN>< PP>PP ⇒< P >< NP >
V ⇒ verb|ǫNN ⇒ noun|ǫAV ⇒ auxiliary − verb|ǫP ⇒ preposition|ǫADJ ⇒ adjective|ǫD ⇒ determiner|ǫ
2) For compound sentences: S⇒<S><CONJ><S>As compound sentence is build by joining of 2 simple
sentence using a conjunction.
3) For complex sentences:S⇒<S><DW><S> | <DW><S><,><S>
As complex sentence is formed by a simple independentsentence and a dependent sentence, where every dependentsentence starts with a dependent word. Now a complexsentence is formed in two ways.
1. If dependent sentence comes first then the sentence startswith a dependent word and two sentence must be separatedby a comma(,).
2. If dependent sentence comes last then the dependentword must separate the two sentence.In case of both complex and compound sentence,the individualsimple sentence(S) follows all the production rule of thesimple sentence.
Start symbol: A grammar has a single nonterminal (thestart symbol) from which all sentences are derived. Allsentences are derived from S by successive replacement usingthe productions of the grammar.Null symbol: it is sometimes useful to specify that a symbolcan be replaced by nothing at all. To indicate this, we usethe null symbolǫ, e.g., A ⇒ B|ǫ. In our defined grammarany non-terminal symbol except S,VP and NP has a nullproduction.
B. Parsing Tree Generation
Whenever a normal user gives any sentence as an inputto the system, a parsing tree is generated by using ourdefined grammer. So,from this parsing tree we can recognizenoun,verb,adjective,preposition of the given input sentence.For example if input sentence is ”John is going to Boston bybus”, a parsing tree must be generated like Figure: 1 by theseproduction rules:
S
NP VP
DET ADJ NN PP V NP
John ^ AV V P DET ADJ PP^^
is to ^ ^ BostonP NPgoing
DET ADJ NN PPby
^ ^ Bus ^
NN
Fig. 1. Parse tree for simple sentence
S
S CONJ S
NP VP NP VP
DET ADJ NN PP V NP DET ADJ NN PP V NP
The beautiful apple ^ AV V P DET
ADJ NN PP
^^ It ^ AV
V PP
^ is ^ ^ ^ red ^ ^is rotten
^
but
Fig. 2. Parse tree for compound sentence
S⇒<NP><VP>S⇒<DET><ADJ><NN><PP><VP>S⇒<NN><V><NP>S⇒<NN><AV><V><P><NP >
S⇒John is going to<DET><ADJ><NN><PP>S⇒John is going to<NN><P><NP>S⇒John is going to boston
<P><DET><ADJ><NN><PP>S⇒John is going to boston by<NN>
S⇒John is going to boston by busNow if a user gives a compound sentence as an input
the generation of parse tree starts from the production ofcompound sentence shown as
S⇒<S><CONJ><S>Therefore if a compound sentence like ”The beautiful apple
is red but it is rotten.” comes as an input the generated parsetree must be like Figure: 2 where each simple sentence isgenerated by the production rule of simple sentence.
Now if a complex sentence comes as a input to thesystem, the generation of parse tree follows the productionruleof complex sentence where both dependent and independentsentence must follow the production rule of simple sentence.The production starts from the basic production rule of com-plex sentence shown as:
S⇒<S><DW><S> | <DW><S><,><S>So, if a complex sentence like ”After the student go to the
class, he can give attendence.” comes as an input to the system,the generated parsing tree must be like Figure: 3.
From this parsing tree we can easily identify the typeof the sentence and each POS(Parts of Speech ) likenouns,verbs,adjective,preposition,determiner etc. In our next
The Eighth International Conference on Computing and Information Technology IC2IT 2012
140
S
DW S PUNC S
After NP VP , NP VP
DET ADJ NN PP V NP DET ADJ NN PP V NP
The ^ Student ^ AV V P DET
ADJ
NN pp ^ ^ He ^ AV V P DET ADJ NN PP
^ go to ^ ^ class ^ can give ^ ^ ^ attendance^
Fig. 3. Parse tree for complex sentence
START
Parse the sentence using defined rule and develop the parse tree
Find IC by identifying Noun,Verb and Adjectives.
For each IC do
If SC==OC
Generate suggesion of OC nearer to IC
User agree with
Lagend:
OC= OntologyConcept
IC= Identified Concept
SC= Synonym of IC
match that SC with Ontology Concepts.
NO
NO
The identified concept is out of domain ontology.
Suggested OC
Add the OC
to the Concept
List.
YES
YES
If IC=OCYES
NO
NO
YESFind OC
of that
IO IO= Instance or Individual
of Ontology
If IC=IO
Find synonym of identified concept from wordbnet and
Fig. 4. Flowchart for finding the ontological concepts of each lexeme of asentence
section we deal with these identified lexicons.
C. Recognizing the Ontological Concepts
This step involves finding the ontological concepts usedin the given input sentence. Our general assumption is thateach lexeme in the sentence is represented by using a separateconcept in the ontology, therefore all nouns, adjectives, verbsand pronouns are represented by identified concepts, while thedeterminers,numbers,prepositions and conjunctions are used asa referent of the relevant concept. Here we have defined analgorithm (Figure: 4) for finding the ontological concepts andinstances used in a given input sentence by syntactic mappingof each lexemes with each predefined ontological concepts.
As we have assumed that nouns,verbs and adjectives ofa particular domain are defined as a concept in its domainontology, we have identified nouns,verbs and adjectives fromthe parsing tree and put them into identified concept(IC) list.For each identified concept there may be 4 cases,
1. The Identified concept is identical to any domain ontologyconcept(OC).
2. The IC cannot be mapped to any OC, but any synonymof that IC is defined as an OC in the ontology.
3. The IC is defined as an instance or individual of a conceptin the ontology.
4. The IC is not in the domain of experiment, so it mustnot be recognized by the domain ontology.
In the next step the Identified Concepts must be convertedinto ontological concepts as computer can understand only thevocabulary of ontology. So, for each identified concept of IClist different operation is performed for the above 4 cases.
1. As IC is syntactically equal to a OC, this OC will beadded to ontological concepts list.
2. If the IC cannot be syntactically mapped with any OC,system checks all the synonyms of the IC from WordNet whereWordNet (Fellbaum, 1998) is an English lexical databasecontaining about 120 000 entries of nouns, verbs, adjectivesand adverbs, hierarchically organized in synonym groups(called synsets), and linked with relations, such as hypernym,hyponym, holonym and others. For each synonym concept(SC)of corresponding IC the system tries to syntactically map theSC with the OC. If any syntactically mapped OC is found,that OC must be added to the Ontological Concepts list.
3. If the IC is an instance or individual of an OntologicalConcept, system find the corresponding OC of the instanceand add that OC to Ontological Concepts list.
4. If the IC is not in the domain of experiment, systemcontinue the loop for next iteration with next IC from theIdentified Concepts list.
So, after getting the Ontological Concepts list system buildthe conceptual graph as discribed in the next section.
D. Creating Conceptual Graph from Ontological ConceptsList
In this section we propose an algorithm for generatingconceptual graph from the generated ontological concepts listwhich consists of the following four steps:
Step 1: In the concept list if two same concept occurs withsame instance name or with no instance, then we keep oneconcept with its instance name and discard the other one. Butif the same concept comes with different instance name, wekeep both concepts.
Let us consider the sentence ”India is a large country”comes as an input. After parsing the sentence three conceptsmust be added to ontological concepts list, ’country:india’,’large’ and ’country:*’. Then the system should merge the twocountry concepts into one, and update its ontological conceptsto ’Country:india’ and ’large’.
But if a sentence ”John is playing with Bob.” comesas input, after parsing the sentence three concepts ’Per-son:John’,’play’ and ’Person:Bob’ must be added to ontologi-cal concepts list. Though here two concepts have same nameas Person, but we keep both concepts, as the two conceptshave different instance names.
Step 2: As conceptual graph consists of concepts and
The Eighth International Conference on Computing and Information Technology IC2IT 2012
141
CAT Agent Sitting
Sitting MATLocation
Fig. 5. Forming of subconceptual graph
CAT MATLocationAgent Sitting
Fig. 6. Forming of desired final conceptual graph
relationship between the concepts, from the domain ontologythe system finds the exact relationship between two consec-utive concepts for each consecutive pair of concepts fromontological concepts list.
As an example if a sentence ”Cat is sitting on the mat.”comes as an input, by parsing the sentence three concepts’CAT’,’Sitting’ and ’MAT’ must be added to ontologicalconcepts list. Now in this step the system finds relationshipbetween each pair of consecutive concepts from the domainontology. Let in ontology the relationship is defined like that-’Agent’ is the relationship between ’CAT’ and ’Sitting’, and’Location’ is the relationship between ’Sitting’ and ’MAT’concepts.
Step 3: Make subconceptual graph for consecutive pairof concepts of ontological concepts list by connecting withthe identified relationship between those concepts defined inontology.
So, in our previous example of ”Cat is sitting on themat”, there are two pair of consecutive concepts. So, twosubconceptual graph must be formed which is shown inFigure: 5.
Step 4: Merge the subconceptual graphs by their commonconcept name and develop the final desired conceptual graphwhich must be recognized by any system which have thecommon domain ontology.
So, in the previos example of ”Cat is sitting on the mat”, thetwo subconceptual graphs must be merged by their commonconcept ’Sitting’ and build the desired final conceptual graphwhich is shown in Figure: 6.
V. CASE STUDY
In this work our basic goal is to develop a conceptualgraph from a natural language sentence given by a casualuser by keeping the actual semantics intact, as a computercan understand the semantic of a sentence representing it asaconceptual graph. Now the main problem is, a single sentencecan be represented in various ways though all of them havethe same semantic. So the conceptual graph for every sentencewith unique semantic must be identical. We present here someexample of this problem with three types of sentences, andevery time the same conceptual graph is formed.
APPLE Color:RedhasColor
Fig. 7. conceptual graph for simple sentence.
1) Case1 : For simple sentences: A casual user can givea simple sentence with same semantic but in various format.
1. The apple is red.2. The color of the apple is red.3. The apple is of red color.These three simple sentences have the same semantic but
in different form. After parsing the first sentence we identifytwo concept, ’Apple’ and ’Red’, where ’red’ is an instanceof ontological concept ’Color’ and ’Apple’ is another onto-logical concept. So OC list contains ’Apple’ and ’Color:red’concepts. For second and third sentence the identified conceptare ’Color’,’Apple’ and ’red’, where ’red’ is an instance of’Color’ concept. So the Concept ’Color:*’ and ’Color:red’must be merged as a single concept. So we take the concept’Color:Red’ and discard the other ’Color:*’ concept. FinallyOC list contains two concepts ’Apple’ and ’Color:red’. Soas OC list contains equal elements the developed conceptualgraph is also same for three sentences which is shown inFigure: 7.
2) Case2 : For compound sentences: A casual user cangive a compound sentence with same semantic but in differentformat.
1. John is happy and he is lucky.2. John is lucky and he glad too.These two compound sentence have the same semantic
but in different form. After parsing the first sentence weidentify that there are two simple sentences joined by ’and’.From the first simple sentence we identify two concepts,’John’ and ’Happy’, where ’John’ is an instance of ontologicalconcept ’Person’ and ’Happy’ is another ontological concept.So OC list contains ’Happy’ and ’Person:John’ concepts. Forsecond simple sentece we identify two concepts ’he’, whichrepresnts ’John’ which is an instance of ontological concept’Person’, and another ontological concept ’Lucky’. So afterjoining the concepts by their defined relationship, we getthe conceptual graph represented in Figure: 8. Here the twoindividual conceptual graph are joined by the conjunction’AND’.
For second sentence, it is also a collection of two simplesentences joined by ’and’. For first simple sentence, identifiedconcepts are ’John’, which is an instance of ontologicalconcept ’Person’ and another ontological concept ’Lucky’.For second simple sentece we identify a concept ’he’, whichrepresents john, which is an instance of ontological concept’Person’. But we cannot map the identified concept ’Glad’with any ontological concept. So the system checks the Word-Net for synonyms of glad and finds that glad is identical to’Happy’, which is a domain ontological concept. So from theontological concepts list, system builds the conceptual graphwhich is also identical to Figure: 8. The dotted line represnts
The Eighth International Conference on Computing and Information Technology IC2IT 2012
142
Person:John Happy
LuckyPerson:John
hasAttribute
hasAttribute
AND
Fig. 8. conceptual graph for compound sentence.
that the two individual conceptual graph are joined by sameontological concept ’Person:John’.
3) Case3 : For complex sentences: A casual user cangive a complex sentence with same semantic but in variousformat.
1. Because Ram forgot the time, he missed the test.2. Ram missed the test as he forgot the time.These two complex sentences have the same semantics but
in different forms. After parsing the first sentence, we identifythat there are two simple sentence, first one is dependentand second one is independent simple sentece. From the firstdependent sentence we identify three concepts, ’Ram’,’Forget’and ,’Time’ where ’Ram’ is an instance of ontological concept’Person’ and ’Forget’ and ’Time’ are another ontological con-cepts. So OC list contains ’Person:Ram’,’Forget’ and ’Time’concepts. For independent simple sentence we identify threeconcepts ’he’, which represents ’Ram’ which is an instance ofontological concept ’Person’, and another ontological conceptare ’Miss’ and ’Exam’. So after joining the concepts by theirdefined relationship, we get the conceptual graph representedin Figure: 9. The dotted line represnts that the two individualconceptual graphs is joined by same ontological concept’Person:Ram’.
For second sentence, it is also a collection of two sentences,first one is independent sentence and second is dependentsentence. For first independent simple sentence the identifiedconcepts are ’Ram’, which is an instance of ontologicalconcept ’Person’ and another concepts are ’Miss’ and ’Test’.Now ’Miss’ is a defined ontological concept, but ’Test’ cannotbe mapped with any concept. So the system checks thesynonyms of ’Test’ from WordNet. It finds that, a synonymof ’Test’ is ’Exam’ and it can be mapped with ontologicalconcept ’Exam’. So ’Exam’ must be added to OC List. Thefinal OC list contains three concepts ’Person:Ram’,’Miss’ and’Exam’. For dependent sentece we identify a concept ’he’,which represents ’Ram’, which is an instance of ontologicalconcept ’Person’ while ’Forget’ and ’Time’ are two definedontological concepts. So from the ontological concepts list,system builds the conceptual graph which is also identical toFigure: 9.
VI. CONCLUSION
In this work, we have defined a formal methodology toconvert the natural language sentence into its corresponding
Person:Ram agent Forget clock Time
Person:Ram agent Miss topic Exam
Fig. 9. conceptual graph for complex sentence.
conceptual graph form by consulting with its common domainontology. Thus a casual user and a computer can interactwith each other keeping the semantics of communication. Butthis approach cannot deal with complex sentences properly.Issues related to complex sentences are how to break itinto simple sentences and whether the simple sentences arecausally related or not. If they are causally related, then whichsentence must be executed first. So the future prospect ofthe work will be to define a formal method by which theproblem of representing complex sentence can be overcome.Another prospect of the work is to define a methodology whichcan deal with any kind of domain ontology, where all theverbs,nouns and adjectives of a particular domain may not bedefined as a concept or instance of a concept. In other wordsa methodology have to be developed which is independent ofhow the ontology is defined.
ACKNOWLEDGMENT
We are really greatful to our Information Technology de-partment of NIT,Durgapur for giving us a perfect environmentand all the facilities to do this work.
REFERENCES
[1] Sowa, J. F.:Conceptual Structures Information Processing in Mind andMachine, Addison-Wesley, Reading (1984)
[2] F. V. Harmelen, V. Lifschitz, and B. Porter:Handbook of KnowledgeRepresentation Elsevier, 2008, pp 213−237.
[3] Sowa, John F. (1976):Conceptual graphs for a database interface, IBMJournal of Research and Development 20:4, 336−357.
[4] Cao, T.H., Cao, T.D., Tran, T.L.:A Robust Ontology-Based Methodfor Translat- ing Natural Language Queries to Conceptual Graphs, In:Domingue, J., Anutariya, C. (eds.) ASWC 2008. LNCS, vol. 5367,pp.479492. Springer, Heidelberg (2008)
[5] C. Snae and M.1 Brueckner:Ontology-Driven E-Learning System Basedon Roles and Activities for Thai Learning Environment, In Interdisci-plinary Journal of Knowledge and Learning ObjectsVolume 3,2007.
[6] Nicholas Gibbins, Stephen Harris, Nigel Shadbolt.Agent based SemanticWeb Services, May 20−24, 2003, Budapest, Hungary. ACM 2003.
[7] A. G. Perez, M. F. Lopez and O. Corcho:Ontological Engineering,(Springer)
[8] Alberto H. F., Berthier A., Altigran S., Juliana S.A Brief Survey of WebData Extraction Tools, ACM SIGMOD Record, v.31 n.2, June 2002.
[9] Wael Salloum:A Question Answering System based on Conceptual GraphFormalism,IEEE Computer Society Press, New York, 2009.
[10] D. Moll and M. Van Zaanen:Learning of Graph Rules for QuestionAnswering, Proc. ALTW05, Sydney, December 2005.
[11] F. R. James, J. Dowdall, K. Kaljur, M. Hess, and D. Moll:ExploitingParaphrases in a Question Answering System, In Proc. Workshop inParaphrasing at ACL2003, 2003.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
143
[12] F. D. France, F. Yvon and O. Collin:Learning Paraphrases to Improvea Question-Answering System, In Proceedings of the 10th Conference ofEACL Workshop Natural Language Processing for Question-Answering,2003.
[13] Tru H. Cao and Anh H. Mai:Ontology-Based Understanding of NaturalLanguage Queries Using Nested Conceptual Graphs, Lecture Notesin Computer Science, Springer-Verlag, 2010, Volume 6208/2010, pp.70−83 (2010)
[14] Svetlana Hensman:Construction of Conceptual Graph representationof texts, In Proceedings of Student Research Workshop at HLT-NAACL,Boston, 2004, 49–54.
[15] Damljanovic, D., Tablan, V., Bontcheva, K.:A text-based query interfaceto owl ontologies, In: 6th Language Resources and Evaluation Conference(LREC). ELRA, Marrakech (May 2008)
[16] Tablan, V., Damljanovic, D., Bontcheva, K.:A natural language queryinter- face to structured information, In: Bechhofer, S., Hauswirth, M.,Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp.361375. Springer, Hei- delberg (2008)
[17] Danica Damljanovic, Milan Agatonovic, and Hamish Cunningham:Natural Language Interfaces to Ontologies: Combining Syntactic Analysisand Ontology-Based Lookup through the User Interaction, In: Proceed-ings of the 7th Extended Semantic Web Conference (ESWC 2010).Lecture Notes in Computer Science, Springer-Verlag, Heraklion, Greece(June 2010)
[18] Damljanovic, D., Agatonovic, M., Cunningham, H.:Identification ofthe Question Focus: Combining Syntactic Analysis and Ontology-basedLookup through the User Interaction, In: 7th Language Resources andEvaluation Conference (LREC). ELRA, La Valletta (May 2010)
[19] George A.Miller,WordNet: An On-line Lexical Database, in the Inter-national Journal of Lexicography, Vol.3, No.4, 1990.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
144
Web Services Privacy Measurement Based on Privacy
Policy and Sensitivity Level of Personal Information
Punyaphat Chaiwongsa and Twittie Senivongse
Computer Science Program, Department of Computer Engineering
Faculty of Engineering, Chulalongkorn University
Bangkok, Thailand
[email protected], [email protected]
Abstract—Web services technology has been in the mainstream of
today’s software development. Software designers can select Web
services with certain functionality and use or compose them in
their applications with ease and flexibility. To distinguish
between different services with similar functionality, the
designers consider quality of service. Privacy is one aspect of
quality that is largely addressed since services may require
service users to reveal personal information. A service should
respect the privacy of the users by requiring only the information
that is necessary for its processing as well as handling personal
information in a correct manner. This paper presents a privacy
measurement model for service users to determine privacy
quality of a Web service. The model combines two aspects of
privacy. That is, it considers the degree of privacy principles
compliance of the service as well as the sensitivity level of user
information which the service requires. The service which
complies with the privacy principles and requires less sensitive
information would be of high quality with regard to privacy. In
addition, the service WSDL can be augmented with semantic
annotation using SAWSDL. The annotation specifies the
semantics of the user information required by the service, and
this can help automate privacy measurement. We also present a
measurement tool and an example of its application.
Keywords-privacy; privacy policy; personal information;
measurement; Web services; ontology
I. INTRODUCTION
Web services technology has been in the mainstream of software development since it allows software designers to use Web services with certain functionality in their applications with ease and flexibility. Software designers study service information that is published on service providers’ Web sites or through service directories and select the services that have the functionality as required by the application requirements. For those with similar functionality, different aspects of quality of service (QoS) are usually considered to distinguish them.
Privacy is one aspect of quality that is largely addressed since Web services may require service users to reveal personal information. An online shopping Web service may ask a user to give personal information such as name, address, phone number, and credit card number when buying products, and a student registration Web service of a university would also ask for students’ personal information to maintain student records. A Web service should respect the privacy of service users by
requiring only the information that is necessary for its processing as well as handling personal information in a correct manner. From a view of a service user, proper handling of the disclosed personal information is highly expected. From a view of a software designer who is developing a service-based application, it is desirable to select a Web service with privacy quality into the application since the privacy quality of the service contributes to that of the application. The application itself should also respect the privacy of the application users.
In this paper, we present a privacy measurement model for service users to determine privacy quality of a Web service. The model combines two aspects of privacy. That is, it considers the degree of privacy principles compliance of the service as well as the sensitivity level of user information which the service requires. The model follows the approach by Yu et al. [1] which assesses if the privacy policy of a Web service complies with a set of privacy principles. We enhance it by also considering sensitivity level of users’ personal information. The approach by Jang and Yoo [2] is adapted to determine sensitivity level of personal information that is exchanged with the service. According to our privacy measurement model, a service which complies with the privacy principles and requires less sensitive information would be of high quality with regard to privacy. In addition, we develop a supporting tool for the model. The tool relies on augmenting WSDL data elements of the service with semantic annotation using the SAWSDL mechanism [3]. The annotation specifies the meaning of WSDL data elements based on personal information ontology, i.e., a semantic term associated with a data element indicates which personal information the data element represents. Semantic annotation is useful for disambiguating user information that may be named differently by different Web services. As a result, it helps automate privacy measurement and facilitates the comparison of privacy quality of different Web services. Combining these two aspects of privacy, the model is considered practical for service users since the assessment is based on the privacy policy and service WSDL which can be easily accessed.
Section II of this paper discusses related work. Section III describes an assessment of privacy policy of a Web service based on privacy principles and Section IV presents measurement of sensitivity level of personal information. The privacy measurement model combining these two aspects of
The Eighth International Conference on Computing and Information Technology IC2IT 2012
145
privacy is proposed in Section V. The supporting tool is described in Section VI and the paper concludes in Section VII.
II. RELATED WORK
W3C has stated in the Web Services Architecture Requirements [4] that Web services architecture must enable privacy protection for service consumers. Web services must express privacy policy statements which comply with the Platform for Privacy Preferences (P3P), and the policy statements must be accessible to service consumers. Service providers generally publish privacy policy statements which follow privacy protection guidelines proposed by governmental or international organizations, and these statements are the basis for privacy protection measurement.
A. Related Work in Privacy Measurement Based on Privacy
Policy
Following Canadian Standards Association Privacy Principles, Yee [5] specifies how to define privacy policy, and a method to measure how well a service protects user privacy based on measurement of violations of the user’s privacy policy. The work is extended to consider compliances between E-service provider privacy policy and user privacy policy using a privacy policy agreement checker [6]. Similarly, Xu et al. [7] provide for a composite service and its user a policy compliance checker which considers sensitivity levels of personal data that flow in the service together with trust levels and data flow permission given to the services in the composition. Tavakolan et al. [8] propose a model for privacy policy and a method to match and rank privacy policies of different services with user’s privacy requirements. We are particularly interested in the work by Yu et al. [1] which follows 10 privacy principles defined in the Australia National Privacy Principles (Privacy Amendment Act 2000). The work proposes a checklist to rate privacy protection of a Web service with regard to each privacy principle. A privacy policy checker which can be plugged into the Web service application is also developed to check for privacy principles compliance.
B. Related Work in Privacy Measurement Based on
Sensitivity Level of Personal Information
Yu et al. [9] present a QoS model to derive privacy risk in service composition. The privacy risk is computed using the percentage of private data the users have to release to the services. The users can define weights that quantify a potential damage if the private data leak. Hewett and Kijsanayothin [10] propose privacy-aware service composition which finds an executable service chain that satisfies a given composite service I/O requirements with minimum number of services and minimum information leakage. To quantify information leakage, sensitivity levels are assigned to different types of personal information that flows in the composition. The composition also complies with users’ privacy preferences and providers’ trust. We are particularly interested in the comprehensive view of privacy sensitivity level of Jang and Yoo [2]. They address four factors of sensitivity, i.e. degree of conjunction, principle of identity, principle of privacy, and value of analogism. They also give a guideline to evaluate these sensitivity factors which we can adapt for the work.
III. ASSESSMENT OF WEB SERVICE PRIVACY POLICY
For the privacy policy aspect, we simply adopt a privacy
principles compliance assessment by Yu et al. [1]. According
to the Australia National Privacy Principles (Privacy
Amendment Act 2000), there are 10 privacy principles for
proper management of personal information. For each
principle, Yu et al. list a number of criteria to rate privacy
compliance of a service. For full detail of the compliance
checklist, see [1]. Here we show a small part of the checklist
through our supporting tool in Fig. 1. For instance, there are 3
criteria that a service has to follow to comply with the
collection principle, i.e., the privacy policy statements must
state (1) the kind of data being collected, (2) the method of
data collection, and (3) the purpose of data collection. The
service user can check with the published privacy policy how
many of these criteria the service satisfies, and then give the
compliance rating score. Thus for the collection principle, the
maximum rating is 3; the rating ranges between 0-3. The
service user can also define a weighted score for each privacy
principle denoting the relative importance of each principle.
The total privacy principle compliance (Pcom) score of a
service is computed by (1) [1]:
10
1
*i
com i
imaxi
rP p
r
(1)
where
ri = rating for principle i assessed by service user
rimax = maximum rating for principle i
pi = weighted score for principle i assigned by service
user, and 10
1
100i
i
p
.
Pcom ranges between 0-100. Instead we will later use a normalized NPcom, as in (2), which ranges between 0-1 in our privacy measurement model in Section V:
10
1
* /100100
i comcom i
imaxi
r PNP p
r
. (2)
As an example, a user of a Register service of a university, which registers student information, rates and gives a weight for each privacy principle as in Table I. Pcom of this service then is 87.08 and NPcom is 0.87.
Figure 1. Assessing privacy principles compliance using our tool.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
146
TABLE I. EXAMPLE OF PRIVACY PRINCIPLES COMPLIANCE RATING
No. Privacy
Principles
Rating
ri
Max
Rating
rimax
Weight
pi
Score
ri/rimax*pi
1 Collection 2 3 20 13.33
2 Use and
Disclosure
2 2 10 10
3 Data Quality 2 2 5 5
4 Data Security 2 2 10 10
5 Openness 2 2 5 5
6 Access and
Correction
3 4 5 3.75
7 Identifiers 2 2 2 2
8 Anonymity 0 1 5 0
9 Transborder
Data Flows
2 2 8 8
10 Sensitive Information
1 1 30 30
Total 100 Pcom = 87.08
NPcom = 0.87
IV. ASSESSMENT OF SENSITIVITY LEVEL OF PERSONAL
INFORMATION
The motivation for assessing sensitivity level of personal
information is that, for different Web services with similar
functionality, a service user would prefer one to which
disclosure of personal information is limited. It is therefore
desirable that less number of personal data items is required by
the service and the data items that are required are also less
sensitive. We adapt from the approach by Jang and Yoo [2]
which analyzes sensitivity level of personal information based
on personal information classification.
A. Formal Concept Analysis and Ontology of Personal
Information
Jang and Yoo represent personal information classification using a formal concept analysis (FCA) [11]. The formal definition of a data group, i.e., personal information in this case, is given as
DG = (G, N, R)
where G is a finite set of concepts and can be described as G = g1, g2, ..., gn,
N is a finite set of attributes which describe the concepts and can be described as N = n1, n2, ..., nm, and
R is a binary relation between G and N, i.e., R ⊆ G × N. For example, g1 R n1, or (g1, n1) ∈ R, represents that the concept g1 has an attribute n1.
The formal concepts can also be described using a cross
table. We extend the cross table of [2] to create one as shown
in Table II. Here personal information is classified into 7
concepts, i.e., G = Basic, Career, …, Finance, and there are
37 personal information attributes, i.e., N = BirthPlace,
BirthDay, …, CreditcardNumber. The cross table shows the
relation, marked by an x, between each concept and attributes
of the concept. For example, BirthPlace belongs in the Basic
and Private concepts while the Basic concept has 15 attributes,
i.e., BirthPlace, BirthDay, …, DrivingLicenseNumber.
For a Web service, its WSDL interface document defines
what users’ personal information is required for the processing
of the service. However, different services with similar
functionality may name the exchanged data elements
differently. A service, for example, may require a data element
called Address whereas another requires Addr. In order to infer
that the two services require the same personal data, both
Address and Addr elements in the two WSDLs can be
annotated with the same semantic information. To
disambiguate user information that may be named differently
by different services, we augment WSDL data elements of a
service with semantic annotation using the SAWSDL
mechanism [3]. The annotation specifies the meaning of
WSDL data elements based on personal information ontology.
We represent the personal information concepts and attributes
in the cross table (Table II) as an OWL-based personal
information ontology as in Fig. 2. The attribute
sawsdl:modelReference is associated with a data element in
the WSDL document to reference to a semantic term in the
ontology. In the WSDL of the Register service in Fig. 3, the
meaning of the data element called Name is the term
PersonName in the ontology in Fig. 2, etc. Semantic
annotation is useful for automating privacy measurement and
facilitates comparison of privacy quality of different services.
TABLE II. CROSS TABLE OF PERSONAL INFORMATION, ADAPTED FROM [2]
The Eighth International Conference on Computing and Information Technology IC2IT 2012
147
Figure 2. Part of personal information ontology.
Figure 3. Part of semantics-annotated WSDL document.
B. Sensitivity Level of Personal Information
Jang and Yoo [2] address four factors of privacy sensitivity for personal information, i.e. degree of conjunction, principle of identity, principle of privacy, and value of analogism. They also give a guideline to evaluate these sensitivity factors which we can adapt for the work. We define the formula to compute the scores of these factors based on the cross table (Table II) as follows.
1) Degree of conjunction of an attribute (personal data
item) n is derived from the number of concepts which the
attribute n describes. This means n is associated with these
concepts and the disclosure of n may lead to other information
belonging in these concepts. The degree of conjunction of n or
DC(n) is determined by (3):
( ) .Cnumber of concepts in which nbelongs
D ntotal number of concepts
(3)
For example, from Table II, PersonName is associated with 5 out of 7 concepts, i.e., Basic, Career, Health, School, and Finance. Therefore DC(PersonName) = 5/7.
2) Principle of identity of an attribute n indicates that n is
an identity attribute of the concept with which it is associated,
i.e., n is used as a key information to access other attributes in
that concept. Disclosure of n may then lead to more problems
than disclosure of other attributes. The principle of identity of
n or IA(n) is determined by (4):
(4)
For example, from Table II, StudentID is an identity
attribute (i.e., it belongs in the concept Identity) for the concept School. There are 10 attributes associated with School and there are 37 attributes in total. Therefore IA(StudentID) = 10/37. For HomeAddress, it is not an identity attribute and IA(HomeAddress) = 0.
3) Principle of privacy of an attribute n indicates that n is
private information. Note that this is subjective to the service
users, e.g., some users may consider Age as private
information whereas others may not. We let the service users
customize the cross table by specifying which attributes are
considered private, i.e., belong in the concept Private. The
principle of privacy of n or PA(n) is determined by (5):
(5)
For example, from Table II, CellphoneNumber is private and PA(CellphoneNumber) = 1, whereas PersonalEmailAddress is not and PA(PersonalEmailAddress) = 0.
4) Value of analogism of an attribute n indicates that n can
be used to derive other attributes. This means the knowledge
of n can also reveal other personal information. The value of
analogism of n or AA(n) is determined by (6):
(6)
The analogy between attributes has to be defined and
associated with the cross table and the personal information ontology. For example, SocialSecurityNumber can derive other attribute such as BirthPlace, and AA(SocialSecurityNumber) = 1, whereas Age cannot and AA(Age) = 0.
0
( )
.
A
if nis not identity attribute
I n
if nis identity attributenumber of attributes intheconcepts
for theconceptstotal number of attributes
0
( )
1 .
A
if n does not belong in theconcept Private
P n
if nbelongs in theconcept Private
0
( )
1 .
A
if n cannot deriveother attributes
A n
if n can deriveother attributes
<xs:element name="RegisterRequest"> <xs:complexType> <xs:sequence> <xs:element name="Name" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#PersonName"/> <xs:element name="Address" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#HomeAddress"/> <xs:element name="MobilephoneNo" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#CellphoneNumber"/> <xs:element name="Email" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#PersonalEmailAddress"/> <xs:element name="StdID" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#StudentID"/> </xs:sequence> </xs:complexType> </xs:element>
The Eighth International Conference on Computing and Information Technology IC2IT 2012
148
All four sensitivity factor scores range between 0-1. Based on these scores, Jang and Yoo suggest that the sensitivity level of an attribute n or SL(n) be determined by (7) [2]:
SL(n) = DC(n) + IA(n) + PA(n) + AA(n). (7)
We propose to compute the sensitivity level of all personal
information exchanged with a Web service using (8):
1
ws i
k
L L
i
S S
(8)
where k = number of exchanged personal data elements
SLi = sensitivity level of personal data element i
computed by (7). We will later use a normalized NSLws, as in (9), which
ranges between 0-1 in our privacy measurement model in Section V:
1
.4 4
i ws
ws
kL L
L
i
S SNS
k k
(9)
As an example, suppose a Register service of a university requires the following personal information: Name, Address, MobilephoneNo, Email, and StdID. In the WSDL in Fig. 3, these data elements are annotated with semantic terms described in the personal information ontology in Fig. 2. We can determine the sensitivity level of each data element by calculating the sensitivity level of the associated semantic term using (7), and the total sensitivity level of all personal data required by the service using (8) and (9) as in Table III.
V. WEB SERVICES PRIVACY MEASUREMENT MODEL
We combine the two privacy aspects in Sections III and IV
into a privacy measurement model. The normalized privacy
principles compliance NPcom of a service is a positive aspect.
A service user would prefer a service with high compliance
rating. The service provider is encouraged to follow privacy
principles, provide proper management of users’ personal
information, and publish a clear privacy policy that can
facilitate compliance rating by the service users. On the
contrary, the normalized sensitivity level NSLws for the service
is a negative aspect. Using a service which exchanges highly
sensitive personal data could mean high risk of privacy
violation if these data are disclosed or not protected properly.
TABLE III. EXAMPLE OF SENSITIVITY LEVEL MEASUREMENT
Data
Element
Semantic
Annotation n
DC(n)
(3)
IA(n)
(4)
PA(n)
(5)
AA(n)
(6)
SL(n)
(7)
Name PersonName 5/7 0 0 0 0.71
Address HomeAddress 1/7 0 0 0 0.14
Mobilephone Number
Cellphone Number
6/7 0 1 0 1.86
Email PersonalEmail
Address
3/7 0 0 0 0.43
StdID StudentID 2/7 10/37 0 0 0.56
Total SLws =3.7
NSLws =3.7/
4*5
=0.19
The privacy quality P of a service is computed by (10).
The service user can also define weighted scores α and β to
denote relative importance of the two privacy aspects; α and β
are in [0, 1] and α + β = 1. The service which complies with
the privacy principles and requires less sensitive information
would be of high quality with regard to privacy.
(1 ).wscom LP NP NS (10)
As an example, given equal weights to the two privacy
aspects and the assessment in Tables II and III, the privacy
quality of the Register service is
P = (0.5)(0.87) + (0.5)(1 - 0.19)
= 0.435+0.405 = 0.84.
The Register service has high privacy principles
compliance level and requires personal data that are relatively
not so sensitive. It is therefore desirable in terms of privacy.
VI. DEVELOPMENT OF SUPPORTING TOOL
Besides the proposed model, we have developed a Web-
based tool called a privacy measurement system to support the
model. To be able to automate privacy measurement, the tool
relies on the service WSDL being annotated with semantic
terms described in the personal information ontology. The
usage scenario of the privacy measurement system is depicted
in Fig. 4 and can be described as follows.
1) The privacy measurement system obtains the cross table
and personal information ontology from a privacy domain
expert. In the prototype of the tool, the cross table in Table II
and a personal information ontology that corresponds to the
cross table are used.
2) A service user specifies the Web service to be measured
the privacy. Together with the service WSDL URL, the user
uses the tool to specify the following:
a) Privacy principles compliance rating ri and weight pi
for each privacy principle; the user will have to check with the
privacy policy of the service in order to rate.
Figure 4. Usage scenario of privacy measurement system.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
149
b) Personal data attributes that are considered private;
these attributes will be associated with the concept Private of
the cross table.
c) Weights α and β for the privacy measurement model.
The users of the tool could be end users of the services or
software designers who are assessing privacy quality of the
services to be aggregated in service-based applications.
Additionally, service providers may use the tool for self-
assessment; the measurement can be used for comparison with
competing services and as a guideline for improving privacy
protection.
3) The tool imports the WSLD document of the service. It
is assumed that the service provider annotates the WSDL
based on the personal information ontology.
4) The tool calculates the privacy score of the service and
informs the user.
As an example, a screenshot reporting privacy
measurements of the Register service is shown in Fig. 5.
Figure 5. Example of measurements screen.
VII. CONCLUSION
This paper presents a privacy measurement model which
combines and enhances existing privacy measurement
approaches. The model considers both privacy principles
compliance and sensitivity level of personal information. The
basis of the measurement is the privacy policy published by
the service provider and user’s personal information that is
exchanged with the service. The model can be applied even in
the absence of any of such information. We present also a
supporting tool which can automate privacy measurement
based on semantic annotation added to WSDL data elements.
Generally a service user can consider the privacy score as
one of the QoS scores to distinguish services with similar
functionality. As discussed earlier, the privacy score is
subjective to the users who assess the service. The score may
vary depending on how the service provider provides a proof
of privacy principles compliance, the expectation of the user
when rating the compliance, and the user’s personal view on
private data. Also, the cross table presented in Table II is an
example but not intended to be exhaustive. A privacy
measurement system can adjust the concepts, attributes, and
their relations within the cross table as well as the
corresponding personal information ontology.
Since the measurement tool makes use of semantics-
enhanced WSDLs, a limitation would be that we require the
service providers to specify semantics. However, semantic
information only helps automate the calculation and the
measurement model itself does not rely on semantic
annotation. The approach can still be followed and the
measurement model can still be used even though WSDL
documents are not semantics-annotated.
At present, we target privacy of single Web services. The
approach can be extended to composite services. We are
planning for an empirical evaluation of the model by service
users and an experiment with real-world Web services as well
as cloud services.
REFERENCES
[1] W. D. Yu, S. Doddapaneni, and S. Murthy, “A privacy assessment approach for service oriented architecture applications,” in Procs. of 2nd IEEE Int. Symp. on Service-Oriented System Engineering (SOSE 2006), 2006, pp. 67-75.
[2] I. Jang and H. S. Yoo, “Personal information classification for privacy negotiation,” in Procs. of 4th Int. Conf. on Computer Sciences and Convergence Information Technology (ICCIT 2009), 2009, pp. 1117-1122.
[3] W3C, Semantic Annotations for WSDL and XML Schema, http://www.w3.org/TR/2007/REC-sawsdl-20070828/, 28 August 2007.
[4] W3C, Web Services Architecture Requirements, http://www.w3.org/TR/wsa-reqs/, 11 February 2004.
[5] G. Yee, “Measuring privacy protection in Web services,” in Procs. of IEEE Int. Conf. on Web Services, 2006, pp.647-654.
[6] G. O. M. Yee, “An automatic privacy policy agreement checker for E-services,” in Procs. of Int. Conf. on Availability, Reliability and Security, 2009, pp. 307-315.
[7] W. Xu, V. N. Venkatakrishnan, R. Sekar, and I. V. Ramakrishnan, “A framework for building privacy-conscious composite Web services,” in Procs. of IEEE Int. Conf. on Web Services,2006, pp. 655-662.
[8] M. Tavakolan, M. Zarreh, and M. A. Azgomi, “An extensible model for improving the privacy of Web services,” in Procs. of Int. Conf. on Security Technology, 2008, pp. 175-179.
[9] T. Yu, Y. Zhang, Y., K. J. Lin, “Modeling and measuring privacy risks in QoS Web services,” in Procs. of 8th IEEE Int. Conf. on E-Commerce Technology and 3rd IEEE Int. Conf. on Enterprise Computing, E-Commerce, and E-Services, 2006.
[10] R. Hewett and P. Kijsanayothin, “On securing privacy in composite web service transactions,” in Procs. of 5th Int. Conf. for Internet Technology and Secured Transactions (ICITST’09), 2009, pp. 1-6.
[11] Uta Priss, “Formal Concept Analysis,” http://www.upriss.org.uk /fca/ fca.html/, Last accessed: 24 February 2012.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
150
Measuring Granularity of Web Services with
Semantic Annotation
Nuttida Muchalintamolee and Twittie Senivongse
Computer Science Program, Department of Computer Engineering
Faculty of Engineering, Chulalongkorn University
Bangkok, Thailand
[email protected], [email protected]
Abstract— Web services technology has been one of the
mainstream technologies for software development since Web
services can be reused and composed into new applications or
used to integrate software systems. Granularity or size of a
service refers to the functional scope or the amount of detail
associated with service design and it has an impact on the ability
to reuse or compose the service in different contexts. Designing a
service with the right granularity is a challenging issue for service
designers and mostly relies on designers’ judgment. This paper
presents a granularity measurement model for a Web service
with semantics-annotated WSDL. The model supports different
types of service design granularity, and semantic annotation
helps with the analysis of the functional scope and amount of
detail associated with the service. Based on granularity
measurement, we then develop a measurement model for service
reusability and composability. The measurements can assist in
service design and the development of service-based applications.
Keywords-service granularity; measurement; reusability;
composability; semantic Web services; ontology
I. INTRODUCTION
Web Services technology has been one of the mainstream technologies for software development since it enables rapid flexible development and integration of software systems. The basic building blocks are Web services which are software units providing certain functionalities over the Web and involving a set of interface and protocol standards, e.g. Web Service Definition Language (WSDL) as a service contract, SOAP as a messaging protocol, and Business Process Execution Language (WS-BPEL) as a flow-based language for service composition [1]. The technology promotes service reuse and service composition as the functionalities provided by a service should be reusable or composable in different contexts of use. Granularity of a service impacts on its reusability and composability.
Erl [1] defines granularity in the context of service design as “the level of (or absence of) detail associated with service design.” The service contract or service interface is the primary concern in service design since it represents what the service is designed to do and gives detail about the scope or size of it. Erl classifies four types of service design granularity: (1) Service granularity refers to the functional scope or the quantity of potential logic the service could encapsulate based on its
context. (2) Capability granularity refers to the functional scope of a specific capability (or operation). (3) Data granularity is the amount of data to be exchanged in order to carry out a capability. (4) Constraint granularity is the amount of validation constraints associated with the information exchanged by a capability.
Different types of granularity impacts on service reusability and composability in different ways. Erl differentiates between these two terms. Reusability is the ability to express agnostic logic and be positioned as a reusable enterprise resource, whereas composability is the ability to participate in multiple service composition [1]. A coarse-grained service with a broad functional context should be reusable in different situations while a fine-grained service capability can be composable in many service assemblies. Coarse-grained data exchanged by a capability could be a sign that the capability has a large scope of work and should be good for reuse while a capability with very fine-grained (detailed) data validation constraints should be more difficult to reuse or compose in different contexts with different data formats. Inappropriate granularity design affects not only reusability and composability but also performance of the service. Fine-grained capabilities, for example, may incur invocation overheads since many calls have to be made to perform a task [2]. Designing a service with the right granularity is a challenging issue for service designers and mostly relies on designers’ judgment.
To help determine service design granularity, we present a granularity measurement model for a Web service with semantics-annotated WSDL. The model supports all four types of granularity and semantic annotation is based on the domain ontology of the service which is expressed in OWL [3]. The motivation is semantic annotation should give more information about functional scope of the service and other detail which would help to determine granularity more precisely. Semantic concepts from the domain ontology can be annotated to different parts of a WSDL document using Semantic Annotation for WSDL and XML Schema (SAWSDL) [4]. Based on granularity measurement, we then develop a measurement model for service reusability and composability.
Section II of the paper discusses related work. Section III introduces a Web service example which will be used throughout the paper. The granularity measurement model and
The Eighth International Conference on Computing and Information Technology IC2IT 2012
151
the reusability and composability measurement models are presented in Sections IV and V. Section VI gives an evaluation of the models and the paper concludes in Section VII.
II. RELATED WORK
Several research has addressed the importance of granularity to service-oriented systems. Haesen et al. [5] proposes a classification of service granularity types which consists of data granularity, functionality granularity, and business value granularity. Their impact on architectural issues, e.g., reusability, performance, and flexibility, is discussed. In their approach, the term “service” refers more to an operation rather than a service with a collection of capabilities as defined by Erl. Feuerlicht [6] discusses that service reuse is difficult to achieve and uses composability as a measure of service reuse. He argues that granularity of services and compatibility of service interfaces are important to composability, and presents a process of decomposing coarse-grained services into fine-grained services (operations) with normalized interfaces to facilitate service composition.
On granularity measurement, Shim et al. [7] propose a design quality model for SOA systems. The work is based on a layered model of design quality assessment. Mappings are defined between design metrics, which measure service artifacts, and design properties (e.g., coupling, cohesion, complexity), and between design properties and high-level quality attributes (e.g., effectiveness, understandability, reusability). Service granularity and parameter granularity are among the design properties. Service granularity considers the number of operations in the service system and the similarity between them (based on similarity of their messages). Parameter granularity considers the ratio of the number of coarse-grained parameter operations to the number of operations in the system. Our approach is inspired by this work but we focus only on granularity measurement for a single Web service, not on system-wide design quality, and will link granularity to reusability and composability attributes. We notice that their granularity measurement relies on the designer’s judgment, e.g., to determine if an operation has fine-grained or coarse-grained parameters. We thus use semantic annotation to better understand the service. Another approach to granularity measurement is by Alahmari et al. [8]. They propose metrics for data granularity, functionality granularity, and service granularity. The approach considers not only the number of data and operations but also their types which indicate whether the data and operations involve complicated logic. The impact on service operation complexity, cohesion, and coupling is discussed. Khoshkbarforoushha et al. [9] measure reusability of BPEL composite services. The metric is based on analyzing description mismatch and logic mismatch between a BPEL service and requirements from different contexts of use.
III. EXAMPLE
An online booking Web service will be used to demonstrate our idea. It provides service for any product booking and includes several functions such as viewing product information and creating and managing booking. Fig. 1 shows the WSDL 2.0 document of the service. Suppose the WSDL is enhanced
with semantic descriptions. The figure shows the use of SAWSDL tags [4] to reference to the semantic concepts in a service domain ontology to which different parts of the WSDL correspond. Here the meaning of the data type named ProductInfo is the term ProductInfo in the domain ontology OnlineBooking in Fig. 2, and the meaning of the operation named viewProduct is the term SearchProductDetail.
IV. GRANULARITY MEASUREMENT MODEL
Granularity measurement considers the schema and semantics of the WSDL description. Semantic granularity is determined first and then applied to different granularity types.
Figure 1. WSDL of online booking Web service with SAWSDL annotation.
<?xml version="1.0" encoding="UTF-8"?>
<wsdl:description targetNamespace="http://localhost:8101/GranularityMeasurement/
wsdl/OnlineBooking#"
xmlns="http://localhost:8101/GranularityMeasurement/wsdl/ OnlineBooking#"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:wsdl="http://www.w3.org/ns/wsdl" xmlns:sawsdl="http://www.w3.org/ns/sawsdl"> <wsdl:types>
<xs:schema targetNamespace="http://localhost:8101/
GranularityMeasurement/wsdl/OnlineBooking#" elementFormDefault="qualified">
<xs:element name="viewProductReq" type="productId"/> <xs:element name="viewProductRes" type="productInfo"/> … <xs:simpleType name="productId">
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]4"/> </xs:restriction>
</xs:simpleType>
<xs:complexType name="productInfo" sawsdl:modelReference="http://localhost:8101/Granularity
Measurement/ontology/OnlineBooking#ProductInfo">
<xs:sequence> <xs:element name="productName" type="xs:string"/>
<xs:element name="productType" type="productType"/>
<xs:element name="description" type="xs:string"/> <xs:element name="unitPrice" type="xs:float"/>
</xs:sequence>
</xs:complexType> <xs:simpleType name="productType">
<xs:restriction base="xs:string">
<xs:pattern value="[A-Z]"/> </xs:restriction>
</xs:simpleType> … </xs:schema>
</wsdl:types> <wsdl:interface name="OnlineBookingWSService"
sawsdl:modelReference="http://localhost:8101/Granularity
Measurement/ontology/OnlineBooking#OrderManagement">
<wsdl:operation name="viewProduct" pattern="http://www.w3.org/ns/wsdl/in-out"
sawsdl:modelReference="http://localhost:8101/Granularity
Measurement/ontology/OnlineBooking#SearchProductDetail"> <wsdl:input element="viewProductReq"/>
<wsdl:output element="viewProductRes"/>
</wsdl:operation> … </wsdl:interface> </wsdl:description>
The Eighth International Conference on Computing and Information Technology IC2IT 2012
152
Figure 2. A part of domain ontology for online booking (in OWL).
A. Semantic Granularity
When a part of WSDL is annotated with a semantic term, we determine the functional scope and amount of detail associated with that WSDL part through the semantic information that can be derived from the annotation. Class-subclass and whole-part property are semantic relations that are considered. Class-subclass is a built-in relation in OWL but whole-part is not. We define an ObjectProperty part (see Fig. 2) to represent the whole-part relation, and any whole-part relation between classes will be defined as a subPropertyOf part. Then, semantic granularity of a term t which is in a class-subclass/whole-part relation is computed by (1):
Figure 3. Semantic granularity of ProductInfo and related terms.
SemanticGranularity( ) no.of terms under in either class-subclass relation
or whole-part relation, including itself
t t= (1)
Using (1), Fig. 3 shows semantic granularity of the semantic term ProductInfo and its related terms with respect to class-subclass and whole-part property relations. When an ontology term is annotated to a WSDL part, it transfers its semantic granularity to the WSDL part.
B. Constraint Granularity
A service capability (or operation) needs to operate on correct input and output data, so constraints are put on the exchanged data for a validation purpose. Constraint granularity considers the number of control attributes and restrictions (not default) that are assigned to the schema of WSDL data, e.g.,
• Attribute of <xs:element/> such as “fixed”, “nullable”, “maxOccur” and “minOccur”
• <xs:restriction/> which contains a restriction on the element content.
Constraint granularity R of a capability o is computed by (2):
in m
o ij
i=1 j=1
R = Constraint∑∑ (2)
where n = the number of parameters of the operation o
mi = the number of elements/attributes of ith parameter
Constraintij = the number of constraints of an element/
attribute of a parameter .
In Fig. 1, the operation viewProduct has two constraints on two out of five input/output data elements, i.e., constraints on productId and productType. So its constraint granularity is 2.
C. Data Granularity
A WSDL document normally describes the detail of the data elements, exchanged by a service capability, using the XML schema in its <types> tag. With semantic annotation to a
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" … <owl:Ontology />
<owl:ObjectProperty rdf:ID="part"/> … <owl:Class rdf:ID="OrderManagement" /> … <owl:Class rdf:ID="ProductInfo" />
<owl:Class rdf:ID="HotelInfo" >
<rdfs:subClassOf rdf:resource="#ProductInfo" />
</owl:Class> … <owl:Class rdf:ID="ProductName" >
<rdfs:subClassOf rdf:resource="#Name" />
</owl:Class>
<owl:FunctionalProperty rdf:ID="hasProductID">
<rdfs:subPropertyOf rdf:resource="#part"/>
<rdfs:domain rdf:resource="#ProductInfo" />
<rdfs:range rdf:resource="#ID" />
<rdf:type rdf:resource="&owl;ObjectProperty" />
</owl:FunctionalProperty>
<owl:FunctionalProperty rdf:ID="hasProductName">
<rdfs:subPropertyOf rdf:resource="#part"/>
<rdfs:domain rdf:resource="#ProductInfo" />
<rdfs:range rdf:resource="#ProductName" />
<rdf:type rdf:resource="&owl;ObjectProperty" />
</owl:FunctionalProperty>
<owl:FunctionalProperty rdf:ID="hasProductPrice">
<rdfs:subPropertyOf rdf:resource="#part"/>
<rdfs:domain rdf:resource="#ProductInfo" />
<rdfs:range rdf:resource="#Price" />
<rdf:type rdf:resource="&owl;ObjectProperty" />
</owl:FunctionalProperty>
<owl:FunctionalProperty rdf:ID="hasProductType">
<rdfs:subPropertyOf rdf:resource="#part"/>
<rdfs:domain rdf:resource="#ProductInfo" />
<rdfs:range rdf:resource="#Type" />
<rdf:type rdf:resource="&owl;ObjectProperty" />
</owl:FunctionalProperty> … <owl:Class rdf:ID="SearchProductDetail" />
<owl:Class rdf:ID="SearchProductInfo" >
<rdfs:subClassOf rdf:resource="#SearchProductDetail" />
</owl:Class>
<owl:Class rdf:ID="SearchRelatedProductInfo" >
<rdfs:subClassOf rdf:resource="#SearchProductDetail" />
</owl:Class>
<owl:Class rdf:ID="GetProductUpdate" />
<owl:Class rdf:ID="GetProductPriceUpdate" />
<owl:FunctionalProperty rdf:ID="hasGetProductUpdate">
<rdfs:subPropertyOf rdf:resource="#part"/>
<rdfs:domain rdf:resource="#SearchProductDetail" />
<rdfs:range rdf:resource="#GetProductUpdate" />
<rdf:type rdf:resource="&owl;ObjectProperty" />
</owl:FunctionalProperty>
<owl:FunctionalProperty rdf:ID="hasGetProductPriceUpdate">
<rdfs:subPropertyOf rdf:resource="#part"/>
<rdfs:domain rdf:resource="#SearchProductDetail" />
<rdfs:range rdf:resource="#GetProductPriceUpdate" />
<rdf:type rdf:resource="&owl;ObjectProperty" />
</owl:FunctionalProperty> … </rdf:RDF>
The Eighth International Conference on Computing and Information Technology IC2IT 2012
153
data element, semantic detail is additionally described. If the semantic term is defined in a class-subclass relation (i.e., it has subclasses), then the term will transfer its generalization, encapsulating several specialized concepts, to the data element that it annotates. If the semantic term is defined in a whole-part relation (i.e., it has parts), it will transfer its whole concept, encapsulating different parts, to the data element that it annotates.
For a data element with no sub-elements (i.e., lowest-level element), we determine its granularity DGLE by its class-subclass and whole-part relations. For whole-part, if the element has an associated whole-part semantics, we determine the parts from the semantic term; otherwise the part is 1, denoting the lowest-level element itself (see (3)). For a data element with sub-elements, we compute its granularity DGE by a summation of the data granularity of all its immediate sub-elements DGSE together with the semantic granularity of the element itself (see (4)). Note that (4) is recursive. Finally, for data granularity Do of a capability o, we compute a summation of data granularity of all parameter elements (see (5)).
max(1, )LE p pDG ac ap= + (3)
1
j
m
E SE p p
j
DG DG ac ap=
= + +∑ (4)
1
i
n
o E
i
D DG=
= ∑ (5)
where n = the number of parameters of the operation o
DGE = data granularity of an element with sub-
elements/attributes
m = the number of sub-elements/attributes of an element
DGSE = data granularity of an immediate sub-element/
attribute of an element
DGLE = data granularity of a lowest-level element/
attribute
acp = semantic granularity in the class-subclass relation
of an element/attribute, computed by (1)
app = semantic granularity in the whole-part property
relation of an element/ attribute, computed by (1).
In Fig. 1, the input viewProductReq of the operation viewProduct has no sub-elements or semantic annotation, so its granularity as a DGLE is 1 (0+max(1, 0)). In contrast, the output viewProductRes is of type productInfo which is also annotated with the ontology term ProductInfo. From the schema in Fig. 1, this output has four sub-elements (productName, productType, description, unitPrice). Each sub-element has no further sub-
elements or semantic annotation, so its granularity as a DGLE is 1 as well. In Fig. 3, the semantic term ProductInfo has three direct subclasses and three indirect subclasses as well as four parts. The granularity of the output data viewProductRes as a DGE would be 16 (i.e., ((1+1+1+1)+7+5). Therefore data granularity Do of the operation viewProduct is 17 (1+16).
D. Capability Granularity
The functional scope of a service capability can be derived from data granularity and semantic annotation. If large data are exchanged by the capability, it can be inferred that the capability involves a big task in the processing of such data. We can additionally infer that the capability is broad in scope if its semantics involves other specialized functions (i.e., having a class-subclass relation) or other sub-tasks (i.e., having a whole-part relation). Capability granularity Co of a capability o is then computed by (6):
o o o oC = D +ac +ap (6)
where Do = data granularity of the operation o
aco = semantic granularity in the class-subclass relation
of the operation o, computed by (1)
apo = semantic granularity in the whole-part property
relation of the operation o, computed by (1).
From the previous calculation, data granularity of the operation viewProduct in Fig. 1 is 17. This operation is annotated with the semantic term SearchProductDetail. In Fig. 2, this semantic term is a generalization of two concepts SearchProductInfo and SearchRelatedProductInfo, so the capability viewProduct encapsulates these two specialized tasks. The semantic term SearchProductDetail also comprises two sub-tasks GetProductUpdate and GetProductPriceUpdate in a whole-part relation. Therefore capability granularity of viewProduct is 23 (17+3+3).
E. Service Granularity
The functional scope of a service is determined by all of its capabilities together with semantic annotation which would describe the scope of use of the service semantically. Service granularity Sw of a service w is computed by (7):
1
i
k
w o w w
i
S C ac ap=
= + +∑ (7)
where k = the number of operations of the service w
Co = capability granularity of an operation o
acw = semantic granularity in the class-subclass relation
of the service w, computed by (1)
apw = semantic granularity in the whole-part property
relation of the service w, computed by (1).
The Eighth International Conference on Computing and Information Technology IC2IT 2012
154
In Fig. 1, the online booking service is associated with the semantic term OrderManagement. Suppose the term OrderManagement has no subclasses but comprises eight concepts (i.e., parts) in a whole-part property relation. So its service granularity is the summation of capability granularity of the operation viewProduct (i.e., 23), capability granularity of all other operations, and semantic granularity in class-subclass and whole-part property relations (i.e., 1+9).
It is seen from the granularity measurement model that semantic annotation helps complement granularity measurement. For the case of the operation viewProduct, for example, the granularity of its capability can only be inferred from the granularity of its data if the operation has no semantic annotation. However, by annotating this operation with the generalized term SearchProductDetail, we gain knowledge about its broad scope such that its capability encapsulates both specialized SearchProductInfo and SearchRelatedProductInfo tasks. The additional information refines the measurement.
V. REUSABILITY AND COMPOSABILITY MEASUREMENT
MODELS
As mentioned in Section I, reusability is the ability to express agnostic logic and be positioned as a reusable enterprise resource, whereas composability is the ability to participate in multiple service composition. We see that reusability is concerned with putting a service as a whole to use in different contexts. Composability is seen as a mechanism for reuse but it focuses on assembly of functions, i.e., it touches reuse at the operation level, rather than the service level. We follow the method in [7] to first identify the impact the granularity has on reusability and composability attributes and then derive measurement models for them. Table I presents impact of granularity.
For reusability, a coarse-grained service with a broad functional context providing several functionalities should be reused well as it can do many tasks serving many purposes. Coarse-grained data, exchanged by an operation, could be a sign that the operation has a large scope of work and should be good for reuse as well. So we define a positive impact on reusability for coarse-grained data, capabilities, and services. For composabilty, we focus at the service operation level and service granularity is not considered. A small operation doing a small task exchanging small data should be easier to include in a composition since it does not do too much work or exchange excessive data that different contexts of use may require or can provide. So we define a negative impact on composability for coarse-grained capabilities and data. For constraints on data elements, the bigger number of constraints means finer-grained restrictions are put on the data; they make the data more specific and may not be easy for reuse, hence a negative impact on both attributes.
TABLE I. IMPACT OF GRANULARITY ON REUSE
Granularity Type Reusability Composability
Service Granularity ↑ -
Capability Granularity ↑ ↓
Data Granularity ↑ ↓
Constraint Granularity ↓ ↓
A. Reusability Model
Reusability measurement is derived from the impact of granularity. It can be seen that different types of granularity measurement relate to each other. That is, service granularity is built on capability granularity which in turn is built on data granularity, and they all have a positive impact. So we consider only service granularity in the model since the effects of data granularity and capability granularity are already part of service granularity. The negative impact of constraint granularity is incorporated in the model (8):
1
i
k
w o
i
Reusability S R=
= − ∑ (8)
where Sw = service granularity of the service w
Ro = constraint granularity of the operation o
k = the number of operations of the service w.
A coarse-grained service with small data constraints has high reusability.
B. Composability Model
In a similar manner, we consider only capability granularity and constraint granularity in the composability model because the effects of data granularity are already part of capability granularity. Since they all have a negative impact, we represent composability measure in the opposite meaning. We define a term “uncomposabilty” to represent an inability of a service operation to be composed in service assembly (9):
o oUncomposability C R= + (9)
where Co = capability granularity of the operation o
Ro = constraint granularity of the operation o.
A fine-grained capability with small data constraints has low uncomposability, i.e. high composability.
VI. EVALUATION
We apply the measurement models to two Web services. The first one is the online booking Web service which we have used to demonstrate the idea. It is a general service including a large number of small data and operations. Its scope covers viewing, managing, and booking products. Another Web service is an online order service which has only a booking-related function. The two Web services are annotated with semantic terms from the online booking ontology which describes detail about processes and data in the online booking domain. Table II shows details of some operations of the two services including their capabilities, data, and semantic annotation.
For the evaluation, a granularity measurement tool is developed to automatically measure granularity of Web services. It is implemented using Java and Jena [10] which helps with ontology processing and inference of relations.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
155
Table III presents granularity measurements and reusability scores. The online booking service is coarser and has higher reusability. It is a bigger service with wider range of functions, exchanging more data, and having a number of data constraints. It is likely that the online booking service can be put to use in various contexts. On the other hand, the online order service is finer-grained focusing on order management. The two services are annotated with semantic terms of the same ontology, and additional semantic detail helps refine their measurements.
Table IV presents granularity measurements and uncomposability of the operations annotated with the semantic term UpdateOrder. The operation editOrderItem of the online order service has coarser data and capability compared to the three finer-grained operations of the online booking service, and therefore it is less composable.
VII. CONCLUSION
This paper explores the application of semantics-annotated WSDL to measuring design granularity of Web services. Four types of granularity are considered together with semantic granularity. The models for reusability and composability (represented by uncomposability) are also introduced.
As explained in the example, semantic annotation can help us derive the functional contexts and concepts that the service, capability, and data element encapsulate. Granularity measurement which is traditionally done by analyzing the size of capability and data described in standard WSDL and XML schema documents can be refined and better automated.
TABLE II. PART OF SERVICE DETAIL AND SEMANTIC ANNOTATION
Operation Input Data Type Output Data Type
Name Annotation Name Annotation Name Annotation
Online booking web service
newCart Insert
Order
userId ID orderId ID
addProduct ToCart
Update Order
addProduct OrderItem process Result
Status
delete
Product FromCart
Update
Order
delete
Product
OrderItem process
Result
Status
editProduct
Quantity InCart
Update
Order
editProduct
Quantity
OrderItem process
Result
Status
view
Product
InCart
Search
OrderItem
ByOrderID
orderId ID orderItem
List
-
reservation EditOrder reserved
Order
ID process
Result
Status
Online order web service
createOrder Create
Order
order
Request
Order order
Response
Status
edit
OrderItem
Update
Order
editOrder
ItemInfo
Order orderItem
Response
Status
submit
Order
EditOrder orderId ID order
Response
Status
TABLE III. GRANULARITY AND REUSABILITY
Service Name Granularity Reusability
∑∑∑∑Ro ∑∑∑∑Do ∑∑∑∑Co Sw Sw - ∑∑∑∑Ro
OnlineBookingWSService 48 143 184 194 146
OnlineOrderWSService 10 47 62 72 62
TABLE IV. SERVICE GRANULARITY AND UNCOMPOSABILITY OF
OPERATIONS ANNOTATED WITH UPDATEORDER
Service
Name
Operation
Name
Granularity Uncomposability
Ro Do Co Sw Co + Ro
Online
Booking
WSService
addProduct
ToCart
4 15 18 - 22
DeleteProduct
FromCart
3 14 17 - 20
editProduct
Quantity InCart
4 15 18 - 22
Online
Order WSService
editOrderItem 3 19 22 - 25
For future work, we aim to refine the domain ontology and WSDL annotation. It would be interesting to see the effect of annotation on granularity, reusability, and composability when the WSDL contains a lot of annotations compared to when it is less annotated. Since annotation can be made to different parts of WSDL, the location of annotations can also affect granularity scores. Additionally we will try the models with Web services in business organizations and extend the models to apply to composite services.
REFERENCES
[1] T. Erl, SOA: Principle of Service Design, Prentice Hill, 2007.
[2] T. Senivongse, N. Phacharintanakul, C. Ngamnitiporn, and M. Tangtrongchit, “A capability granularity analysis on Web service invocations,” in Procs. of World Congress on Engineering and Computer Science 2010 (WCECS 2010), 2010, pp. 400-405.
[3] W3C (2004, February 10) OWL Web Ontology Language Overview [Online]. Available: http://www.w3.org/TR/2004/REC-owl-features-20040210/
[4] W3C (2007, August 28) Semantic Annotations for WSDL and XML Schema [Online]. Available: http://www.w3.org/TR/2007/REC-sawsdl-20070828/
[5] R. Haesen, M. Snoeck, W. Lemahieu and S. Poelmans, “On the definition of service granularity and its architectural impact,” in Procs. of 20th Int. Conf. on Advanced Information Systems Engineering (CAiSE 2008), LNCS 5074, 2008, pp. 375-389.
[6] G. Feuerlicht, “Design of composable services,” in Procs. of 6th Int. Conf. on Service Oriented Computing (ICSOC 2008), LNCS 5472, 2008, pp. 15-27.
[7] B. Shim, S. Choue, S. Kim and S. Park, “A design quality model for service-oriented architecture,” in Procs. of 15th Asia-Pacific Software Engineering Conference (APSEC 2008), 2008, pp. 403-410.
[8] S. Alahmari, E. Zaluska, D. C. De Roure, “A metrics framework for evaluating SOA service granularity,” in Procs. of IEEE Int. Conf. on Service Computing (SCC 2011), 2011, pp. 512-519.
[9] A. Khoshkbarforoushha, P. Jamshidi, F. Shams, “A metric for composite service reusability analysis,” in Procs. of the 2010 ICSE Workshop on Emerging Trends in Software Metrics (WETSoM 2010), 2010, pp. 67-74.
[10] Apache Jena [online]. Available: http://incubator.apache.org/jena/, Last accessed: January 30, 2012.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
156
Decomposing ontology in Description Logics
by graph partitioning
Thi Anh Le PHAM
Faculty of Information Technology
Hanoi National University of
Education
Hanoi, Vietnam
Nhan LE-THANH
Laboratory I3S
Nice Sophia-Antipolis University
Nice, France
Minh Quang NGUYEN
Faculty of Information Technology
Hanoi National University of
Education
Hanoi, Vietnam
Abstract— In this paper, we investigate the problem of
decomposing an ontology in Description Logics (DLs)
based on graph partitioning algorithms. Also, we focus on
syntax features of axioms in a given ontology. Our
approach aims at decomposing the ontology into many sub
ontologies that are as distinct as possible. We analyze the
algorithms and exploit parameters of partitioning that
influence the efficiency of computation and reasoning.
These parameters are the number of concepts and roles
shared by a pair of sub-ontologies, the size (the number of
axioms) of each sub-ontology, and the topology of
decomposition. We provide two concrete approaches for
automatically decomposing the ontology, one is called
minimal separator based partitioning, and the other is
eigenvectors and eigenvalues based segmenting. We also
tested on some parts of used TBoxes in the systems FaCT,
Vedaall, tambis, ... and propose estimated results.
Keywords- Graph partitioning; ontology decomposition;
image segmentation
I. INTRODUCTION
The previous studies about DL-based ontologies focus on the tasks such as ontology design, ontology integration and ontology deployment, … Starting from the fact that one wants to effectively solve with a large ontology, instead of the ontology integrating we examine the ontology decomposing. There were some investigations in decomposition of DL ontologies as decomposition-based module extraction [3] or based on syntax structure of ontology [1].
The previous paper [8] shown the executions on the supposition that there exists an ontology (TBox) decomposition called overlap decomposition. This decomposition resulted in preservation of semantic and inference results with respect to original TBox. Our aim is to establish the theoretical foundations for decomposition methods that improves the efficiency of reasoning and guarantees the properties proposed in [7]. The automatic decomposition of a given ontology is an optimal step in ontology design that is supported by graph theory. The graph theory provided the “good properties” that adapt necessary requirements of our decomposition.
Our computational analysis of reasoning algorithms guides us to suggest the parameters of such decomposition: the number of concepts and roles included in the semantic mappings between partitions, the size of each component ontology (the number of axioms in each component) and the topology of the decomposition graph. There are two decomposition approaches based on two ways of presenting the ontology. One presents the ontology by a symbols graph, which implements decomposition by minimal separator and the other uses axiom graph, corresponding to the image segmentation method.
The rest of the paper is organized as follows. Section 2 proposes a definition of G-decomposition methodology that is based on graph and summarizes some principal steps. In this section, we also recall the criteria for a good decomposition. Sections 3 and 4 describe two ways for transforming an ontology into an undirected graph (symbol graph or weighted graph), as well as two partitioning algorithms of the obtained graph. Section 5 presents some evaluations of the effects of the decomposition algorithms and experimental results. Finally, we provide some conclusions and future work in section 6.
II. G-DECOMPOSITION METHODOLOGY
In this paper, ontology decomposition will be considered only from terminological level (TBox). We research some methods that decompose a given TBox into several sub-TBoxes. For simple, now a TBox can be briefly presented by the single set of axioms A, so we will present the set of axioms by graph.
Our goal is to eliminate general concept inclusions (GCIs), a general type of axiom, as much as possible from a general ontology (presented by a TBox) by decomposing the set of GCIs of this ontology into several subsets of GCIs (presented by a distributed TBox). In this paper, we only consider the syntax approach based on the structures of GCIs. We recall some criteria of a good decomposition [8]:
- All the concepts, roles and axioms of the original ontology are kept through the decomposition.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
157
- The number of axioms in the sub-TBoxes is equivalent.
As a result, we propose two techniques for decomposing based on graphs. G-decomposition is an ontology decomposition method that applies graph partitioning techniques. This graph decomposition is presented by an intersection graph (decomposition graph) in which each vertex is a sub-graph and the edges present the connections of each pair of vertices. In [8] we defined an overlap decomposition of a TBox, it is presented by a distributed TBox (decomposed TBox) that consists of a set of sub-TBoxes and a set of links between these sub-TBoxes (semantic mappings). We assume that readers are familiar with basic concepts in graph theory.
Consequently, we propose an ontology decomposition method as a process contain three principal phases (illustrated in the table 1): transform a TBox into a graph, decompose the graph into sub-graphs, transform these sub-graphs into a distributed TBox. We present a general algorithm of G-decomposition.
TABLE 1. A GENERAL ALGORITHM OF G-DECOMPOSITION
PROCEDURE DECOMP-TBOX (T = (C, R, A))
T = C, R, A is a TBox, with the set of concepts C, the set
of roles R, and the set of axioms A
(1) TRANS-TBOX-GRAPH (T = (C, R, A))
Build a graph G = (V, E) of this TBox, where each vertex v ∈V is a concept in C or a role in R (or an axiom in A), each edge e = (u, v) ∈ E if u and v appear in the same axiom (or u and v have at least a common concept (role))
(2) DECOMP-GRAPH (G = (V, E ))
Decompose the obtained graph G = (V, E) in the procedure TRANS-TBOX-GRAPH into an intersection graph G0 = (V0, E0), with each vertex v∈V0 is a sub-graph, each edge e = (u, v)∈E0 if u and v are linked.
(3) TRANS-GRAPH-TBOX (G0 = (V0, E0)
Transform the graph G = (V0, E0) into a distributed TBox, each vertex (sub-graph) corresponds to a sub-TBox, and edges of E0 correspond to semantic mappings.
In the next sections, we will introduce the detail techniques for the steps (1) and (2).
III. DECOMPOSITION BASED ON MINIMAL SEPARATOR
A. Symbol graph
A set of axioms A of TBox T, then Ex(A) is the set of
concepts and roles that appear in the expressions of A. For simple, we use the notation of symbol instead of concept (role), i.e., a symbol is a concept (role) in TBox. A graph presenting TBox will be defined as follow:
Definition 1 (symbol graph): A graph G = (V, E), where V is a set of vertices et E is a set of edges, is called a symbol
graph of T (A) if each vertex v V is a symbol of Ex(A), each
edge e = (u, v) E if u, v are in the same axiom of A.
So, given a set of axioms A, we can build a symbol graph G = (V, E) by taking each symbol in Ex(A) as a vertex and connecting two vertices by an edge if its symbols are in the same axiom of A. Follow this method, each axiom is presented as a clique in the symbol graph.
Example 1: Given a TBox as follows (figure 1):
Figuer 1. TBox Tfam
The set of primitive concepts and roles of Tfam: Ex(Tfam) =
C1; C2; C3; C4; C5; C6; X; Y; T; H. The figure 2 presents the symbol graph for Tfam
Figure 2. Symbol graph presenting TBox Tfam
Result of this decomposition is presented by a labeled graph (intersection graph or decomposition graph) Gp = (Vp, Ep). Assume that the graph G representing a TBox T is divided by n
sub-graphs Gi, in, then a decomposition graph is defined as follows:
Definition 2 (decomposition graph) [4]: Decomposition graph is a labeled graph Gp = (Vp, Ep) in which each vertex v
Vp is a partition (sub-graph) Gi, each edge eij = (vi, vj) Ep
is marked by the set of shared symbols of Gi and Gj, where i j,
i, j n.
Definition 3 ((a,b) – minimal vertex separators)[3]: A set
of vertices S is called (a,b) - vertex separator if a,b V\S and all paths connecting a and b in G pass through at least one vertex in S. If S is an (a,b) - vertex separator and not contains another (a,b) - vertex separator then S is an (a,b) - minimal vertex separator.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
158
B. Algorithm
We present a recursive algorithm using Even algorithm [4] to find the sets of vertices that divide a graph into partitions. It takes a symbol graph G = (V, E) (presenting a TBox) as input and returns a decomposition graph and the set of separate sub-graphs. The idea of algorithm is to look for a connecting part of graph (cut), compute the minimal separator of graph, and then the graph is carved by this separator. Initially, the TBox T is
considered as a large partition, and it will be cut into two parts in each recursive iteration. The steps of algorithm are summarized as follows:
TABLE 2. AN ALGORITHM FOR TRANSFORMING THE TBOX INTO GRAPH
Input: TBox T (A), M limit number of symbols in a part (a sub-
TBox Ti ).
Output: Gp = (Vp, Ep), and Ti .
PROCEDURE DIVISION-TBOX (A , M )
(1) Transform A into a symbol graph G (V, E) with V =
Ex (A), E = (l1, l2)|∃A∈A, l1, l2 ∈Ex (A).
/A is an axiom in A
(2) Let Gp = (Vp, Ep) a undirect graph with Vp = V
and Ep = ∅.
(3) Call DIVISION-GRAPH(G, M, nil, nil).
(4) For each v ∈ Vp, let Tv = A ∈A |Ex(A) ⊂ v. Return
Tv, v ∈ Vp and Gp.
The procedure DIVISION-GRAPH takes the input consisting of a symbol graph G = (V, E) of T, a limited parameter M and
two vertices a, b that are initially assigned to nil. This procedure updates the global variable Gp for presenting the decomposition process. In each recursive call, it finds a minimal separator of vertices a, b in G. If one of a, b is nil or both are nils, it finds the global minimal separator between all vertices and the non-nil vertex (or all other vertices). This separator cuts the graph G into two parts G1, G2 and the process continues recursively on these parts.
TABLE 3. AN ALGORITHM FOR PARTITIONING SYMBOL GRAPH
Input: G = (V, E)
Output: connection graph Gp = (Vp, Ep) PROCEDURE DIVISION-GRAPH (G, M, a, b)
(1) Find the set of minimal vertex separator of G: - Select an pair of non-adjacent arbitrary vertices (a, b)
and compute set of (a, b) – minimal separator
- Repeat this process on every pair of non-adjacent
vertices x, y
(2) Find global minimal vertex separator S* between all
vertices of G
(3) Decomposing G by S* in two sub-graphs G1, G2,
where S* is in all G1 and G2
(4) Generate an undirect graph Gp = (Vp, Ep), where Vp =
G1, G2 and Ep = S*.
The algorithm describes a method to list all of the (a, b) - vertex minimal separators from a pair of non-adjacent vertices by best-first search technique can be seen in [6].
Tfam (figure 1) can be presented by an undirect adjacent graph
(figure 2), where the vertices of this graph correspond to the symbols, the edges of the graph connecting the two vertices corresponding to two symbols of the same axiom. Therefore each axiom would be represented as a clique.
Figure 3. Decomposing result of symbol graph of Tfam with minimal
separator S*= X and S*’ =Y
If the criterion based on balance of TBox axioms number between components then S
*= X and S
*’=Y. Using S* and
S*’ to decompose the symbol graph, we obtain three symbol groups C1, C2, C3, X, C4, C5, X, Y and C6, H, Y, T. So we get three corresponding TBoxes:
1T = A1, A2, A7, A8, 2T =
A3, A4, A9, A10 and T3 = A5, A6. The number of symbols of
S*
and S*’
is 1 (|S*| = |X| = 1, |S
* ’| = |Y|=1). The
cardinality of three TBoxes are respectively N1 = 4; N2 = 4, N3
= 2. In this case, the cardinality of symbols in each TBox is also equivalent.
The image of symbol graph of Tfam after decomposing as in
the figure 3. Obtained Result TBoxes 1T ,
2T and T3 after
decomposition preserve all the concepts, roles and axioms of original Tfam. In addition,
1T and 2T satisfy the proposed
criteria for decomposing.
We have executed graph partition algorithm based on minimal separator. This method returns result that satisfies the given properties. All concepts, roles, and axioms are preserved through decomposing. Relations between them are represented by the edges of symbol inter-graph. This method minimizes symbols shared between component TBoxes ensuring independency property of sub-TBoxes. However, to get the
The Eighth International Conference on Computing and Information Technology IC2IT 2012
159
result TBoxes, requiring transfer the obtained graphs into the
sets of axioms for the corresponding TBoxes.
IV. DECOMPOSITION BASED ON NORMALIZED CUT
A. Axiom graph
In this section, we propose another decomposition technique
based on axiom graph that is defined as follow:
Definition 4 (axiom graph): A weight undirect graph G = (V, E), where V is a set of vertices and E is a set of edges with
the weight values, is called an axiom graph if each vertex v V
is an axiom in TBox T, each edge e = (u, v) E if u, v V and
there is at least a shared symbol between u and v, and the weight on e (u, v) is a value presenting the similarity between the vertices u and v.
By using only the common symbols between each pair of axioms, we can simple define a weight function p: V x V R that send a pair of vertices to a real number. In particular, each edge (i, j) is assigned a value wij describing the connection (association) between axioms Ai and Aj as: wij = nij/(ni + nj),
where i, j = 1,..,m, i j, m is the number of axioms in T (m = |A|), ni, nj is the symbol number of Ai and Aj respectively, nij is
the number of symbols in Ai Aj (nij = |Ai Aj|).
B. Normalized cut
Ontology decomposition algorithm based on image segmentation is a grouping method using eigenvectors:
Assume that G = (V, E) is divided into two separate sets A
and B, (A B = V and A B = ) by removing the edges connecting two parts from the original graph. The association between these parts is the total weight of the removed, in the language of graph theory, it is called the cut:
cut(A, B) = u A, v B
w(u, v) (1)
i.e. the total number of connections between the nodes of A and the nodes of B. Optimal decomposition of graph is not only to minimize this disassociation, but also to maximize the association in every partition. In addition, NCut (normalized Cut) is also used to measure disassociation, denoted by Ncut as follows:
Ncut (A, B) = cut(A,B)
assoc(A,V) +
cut(A,B)
assoc(B,V) (2)
where, assoc(A, V) = u A, t V
w(u, t) is the total of
connections from the nodes of A to all nodes of V. Similarly, N assoc is denoted by:
Nassoc (A, B) = ( , )
( , )
assoc A A
assoc A V +
( , )
( , )
assoc B B
assoc B V (3)
where assoc (A, A) and assoc (B, B) are the weight totals of edges in A and in B respectively. The optimal division of graph
reduce to minimize not only NCut but also to maximize Nassoc in the partitions.
It is easy to see that Ncut(A, B)= 2 – Nassoc(A, B). This is an important property of decomposition. Because of two obtained criteria ((2) and (3)) from the decomposition algorithm, minimizing the dissociation between the parts and maximizing the association in each part, are identical in reality and can be satisfied simultaneously.
Unfortunately, minimizing the normalized cut is exactly complete-NP, even for the particular case of graph on grids. However, the authors in [5] also indicated that if the normalized cut problem is extended in the real value domain, then an approximate solution can be efficiently found.
C. Algorithm
Given a N-dimensional vector x, N =|V|, where xi = 1 if the node i is in A and xi = -1 if the node i is not in A. Let di =
jw(i, j) the total of connections from i to all other nodes.
Let D be a diagonal matrix N × N, d the main diagonal of the matrix, W a symmetric matrix N ×N with W(i, j) = w i j .
Ontology decomposition algorithm based on image segmentation consists of the following steps:
1) Transform a set of axioms A to an axiom graph G =
(V, E) with V = v|vV and E = (u,v)|u, vV,
w(u,v) = ||
||
vu
vu
.
2) Find the minimum value of NCut by solving : (D – W)x
= Dx to find eigenvectors corresponding to the
smallest eigenvalues .
3) Using eigenvectors with the second smallest eigenvector to decompose the graph into two parts. In the ideal case, the eigenvector only obtains two eigenvalues and the value signs propose a graph decomposition method.
4) Implementing the recursive algorithm on the two decomposed subgraphs.
The decomposition algorithm of TBox based on normalized cut [5] is illustrated by the procedure DIVISION-TBOX-NC. This procedure takes a TBox T with the set of axioms A as
input. It transforms A into an axiom graph G = (V, E), where
each axiom Ai of A is a vertex iV, each edge (i, j) E is
assigned by a weight w(i,j) =|)Ex(A )Ex(A|
|)Ex(A )Ex(A|
i
j
j
i
. Then, the
process is performed as the procedure TBOX-DIVISION in the figure 3.
The DIVISION-TBOX-NC uses the DIVISION-GRAPH-A procedure for dividing the axiom graph presenting T. This procedure takes the axiom graph G as input, compute the matrices W, D. W is a valued matrix NxN with w(i,j) computing as below. D is a diagonal matrix NxN with the
values d(i) = iw(i, j) on its diagonal. Then, we resolve the
equation (D-W)y = Dy with the constraints yTDe = 0 and
The Eighth International Conference on Computing and Information Technology IC2IT 2012
160
yi2, -2b, where e is a vector Nx1 to all ones, to find the smallest eigenvalues. The second smallest eigenvalue is chosen and it is the minimal value as NCut. We take the eigenvector that corresponds to this eigenvalue for dividing G into two parts G1, G2. Finally the GRAPH-DIVISION-A updates the variable Gp as in the method based on minimal separator. This procedure can be performed recursively, in each recursive call on Gi, it finds an eigenvector with the second smallest eigenvalue and the process continues on Gi.
TABLE 4. AN ALGORITHM FOR TRANSFORMING THE TBOX INTO AXIOM GRAPH
Input: the TBox T with the set of axioms A
Output: the decomposition graph Gp = (Vp, Ep) and T .
PROCEDURE DIVISION-TBOX-NC(A)
(1) Transform a set of axioms A to an axiom graph G = (V, E)
with V = v|v ∈ A and E = (u, v) |u , v ∈V, w (u,v ) =
||
||
vu
vu
(2) Let Gp = (Vp, Ep) an undirect graph with Vp = V and Ep = ∅
(3) Execute DIVISION-GRAPH-A(G = (V, E))
(4) For each v ∈ Vp, take Tv = A ∈ A |A = v. Return Tv , v
∈ V, p and Gp.
TABLE 5. AN ALGORITHM FOR DECOMPOSING AXIOM GRAPH
Input: the axiom graph G = (V, E)
Output: the decomposition graph Gp = (Vp, Ep)
PROCEDURE DIVISION-GRAPH-A(G (V, E))
(1) Find the minimal value of NCut et resolving the equation (D −W) x = λDx for eigenvectors with the smallest eigenvalues.
(2) Use the eigenvector with the second smallest eigenvalue for decomposing the graph into two sub-graphs G1, G2.
(3) Let Vp ←Vp \V∪V1, V2 and Ep ←Ep ∪ (V1, V2). We change the edges connecting to V for the links to one of V1, V2
(4) After the graph is divided to two parties, we can implement recursively:
DIVISION-GRAPH-A(G1), DIVISION-GRAPH-A(G2).
V. EXPERIMENT AND EVALUATION
Two graph decomposition algorithms based on minimal separator and image segmentation of ontology are implemented and return the same result satisfying decomposition criteria.
The first algorithm minimizes the shared number of symbols (|S
*| → minimum) and attempts to balance the number
of axioms between parts. After the decomposition, a sign to identify the axioms in obtained graph is cliques. However, some cliques do not present any axiom in the fact. Therefore we need a mechanism to determine the axioms.
The second algorithm obtains the advantage of preserving axioms. By the value of NCut, we can measure the independence between parts and the dependence between elements in every part. However, to install an efficient algorithm, a weighting function for the edges connecting the nodes of the graph axioms must be given.
Selecting of decomposition algorithm is based on structure of original ontology. For example, the second algorithm is used for ontology which has been presented with a lot of symbols while the first algorithm is more suitable to ontology which consists of many axioms.
We applied two decomposition algorithms based on the minimal separator and on the normalized cut to divide a TBox. In this section, we summarize some principal modules that are implemented in our experiments. To illustrate our results, we take a TBox extracting from the file “tambis.xml” in the system FaCT. This TBox, called Tambis1, consists of 30 axioms.
- Transform the ontology into a symbol graph: This module reads a file presenting a TBox in XML. The read file is transformed to a symbol graph. The figure 4 describes the symbol graph of Tambis1 TBox with the labeled vertices by the names of concept and role.
- Transform the ontology into an axiom graph: This module performs the same function with the above module, it results in an axiom graph. The figure 5 describes the axiom graph of tambis1 TBox with the labeled vertices by the symbols Ai (i = 0, .., 29).
- Decompose the ontology based on the minimal separator: decompose an axiom graph to a tree where the leaf nodes are axioms. The figure 6 presents this decomposition for Tambis1.
- Decompose the ontology based on the normalized cut: decompose a symbol graph to a tree where the leaf nodes are axioms. The figure 7 presents this decomposition for Tambis1.
These two methods return the results that satisfy the proposed properties of our decomposition. All the concepts, the roles and axioms are preserved after the decomposition execution. The axioms and their relationships are well expressed by the symbol graph and the axiom graph. The set of axioms in the original TBox decreases by regular distribution in sub-TBoxes. The decomposition techniques focus on finding a good decomposition. The method based on minimal separator minimizes the number of shared symbols between the components and tries to equalize axiom number in these parties. We need recover the axioms after decomposing. It is possible because the axioms were encoded by cliques in the symbol graph. However, in reality the difficult of this problem is that there are some cliques of symbol graph and of intersection graph that are not exactly axioms.
The possible advantage of the decomposition method based on the normalized cut is to conserver the axioms. After
The Eighth International Conference on Computing and Information Technology IC2IT 2012
161
decomposing, we can directly to find the axioms in the components. Furthermore, the measure of NCut is normalized, it expresses the dissociation between the different parties and the association in each party of decomposing. However, the effectiveness of this method depends on the choice of appropriate parameters for calculating the relation of similarity between two axioms.
We tested on the TBoxes in the FaCT system, as Vedaall,
modkit, people, platt, and tambis. The results show that for the
axioms whose expressions are more complex, the application
of normalized cut method is much more effective (e.g, Veda-
all, modkit), while the minimal separator method is better with
the simple axioms (e.g, platt, tambis).
VI. CONCLUSION
In this paper we have presented two techniques of decomposing from ontologies in Description Logics (TBox level). Our decomposition methods aim to reduce the number of GCIs [8], one of the main factors causing complexity to the algorithm argument. TBox separation method based on minimal separator only considers axioms in the aspect of syntax. We examine the simplest case, where the concept and role atoms are equivalent symbols in the axioms. However, in reality, they have different meanings. For example, the concept descriptions C t D and C u D will be represented by the same
symbol graph with symbols, although their meaning is different. Therefore, we will keep on developing methods of ontology separation taking into account of the dependence between symbols based on linking elements and the semantics of the axioms. Besides, we will examine the query processing issue on decomposed ontologies.
REFERENCES
[1] Boris Konev, Carsten Lutz, Denis Ponomaryov, Frank
Wolter, Decomposing Description Logic Ontologies, KR 2010.
[2] Chiara Del Vescovo, Damian D.G.Gessler, Pavel Klinov, Bijan Parsia, Decomposition and modular structure of BioPortal Ontologies, In proceedings of the 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011.
[3] Dieter Jungnickel, Graphs,Networks and Algorithms. Springer1999.
[4] Eyal Amir and Sheila McIlraith, Partition-Based Logical Reasoning for First-Order and Propositional Theories. Artificial Intelligence, Volume 162, February 2005, pp.49-88.
[5] Jianbo Shi and Jitendra Malik, Normalized cuts and Image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.
[6] Kirill Shoikhet and Dan Geiger, Finding optimal triangulations via minimal vertex separators. In Proceedings of the 3rd International Conference, p. 270-281, Cambridge, MA, October 1992.
[7] Thi Anh Le PHAM and Nhan LE-THANH, Some approaches of ontology decomposing in Description Logics. In Proceedings of the 14th ISPE International Conference on Concurrent Engineering: Research and Applications, p.543-542, Brazil, July 2007.
[8] Thi Anh Le Pham, Nhan Le-Thanh, Peter Sander, Decomposition-based reasoning for large knowledge bases in description logics. Integrated Computer Aided Engineering (2008), Volume: 15, Issue: 1, Pages: 53-70.
[9] T.Kloks and D.Kratsch, Listing all minimal separators of a graph. In Proceedings of the 11th Annual Symposium on Theoretical Aspects of Computer Science, Spinger, Lecture Notes in Computer Science, 775, pp.759-768.
Figure 4. Symbol graph of Tambis1
Figure 7. decomposition
graph based on normalized
cut of Tambis1
Figure 6. decomposition
graph based on minimal
separator of Tambis1
Figure 5. Axiom graph of Tambis1
The Eighth International Conference on Computing and Information Technology IC2IT 2012
162
An Ontological Analysis of Common Research Interest for Researchers
Nawarat Kamsiang and Twittie Senivongse Computer Science Program, Department of Computer Engineering
Faculty of Engineering, Chulalongkorn University Bangkok, Thailand
[email protected], [email protected]
Abstract—This paper explores a methodology and develops a tool to analyze common research interest for researchers. The analysis can be useful for researchers in establishing further collaboration as it can help to identify the areas and degree of interest that any two researchers share. Using keywords from the publications indexed by ISI Web of Knowledge, we build ontological research profiles for the researchers. Our methodology builds upon existing approaches to ontology building and ontology matching in which the comparison between research profiles is based on name similarity and linguistic similarity between the terms in the two profiles. In addition, we add the concept of depth weights to ontology matching. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. We see that more attention should be paid to the matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since they are more specialized areas of interest. A comparison between our methodology and an existing ontology matching approach, using OAEI 2011 benchmark, shows that the concept of depth weights gives better precision but lower recall.
Keywords-ontology building; ontology matching; profile matching
I. INTRODUCTION
Internet technology has become a major tool that enriches the way people interact, express ideas, and share knowledge. Through different means such as personal Web sites, social networking applications, blogging, and discussion boards, people express their opinions, interest, and knowledge in particular matters from which a connection or relationship can be drawn. A community of practice [1] can also be formed among a group of people who share common interest or a profession so that they can learn from and collaborate with each other. For academics and researchers, it is useful to know who do what as well as who share common interest for the purpose of potential research collaboration.
Different approaches have been taken to draw association between researchers. One is the use of bibliometrics to evaluate research activities, performance of researchers, and research specialization [2]. It is based on the enumeration and statistical analysis of scientific output such as publications, citations, and patents. Main bibliometric indicators are activity indicators and relational indicators. Activity indicators include the number of papers/patents, the number of citations, and the
number of co-signers indicating cooperation at national and international level. Relational indicators include, for example, co-publication which indicates cooperation between institutions, co-citation which indicates the impact of two papers that are cited together, and scientific links measured by citations which traces who cites whom and who is cited by whom in order to trace the influence between different communities. Another common approach is analysis of research profiles. Such profiles can be constructed by gathering or mining information from electronic sources such as Web sites, publications, blogs, personal and research project documents etc. It is followed by discovering researcher expertise as well as semantic correspondences between researcher profiles.
We are interested in the latter approach and use ontology as a means to describe research profiles. The idea that we explore is building ontological research profiles and using an ontology matching algorithm to compare similarity between profiles. To build an ontological research profile, we obtain keywords from the researcher’s publications that are indexed by ISI Web of Knowledge [3] and apply the Obtaining Domain Ontology (ODO) algorithm by An et al. [4] to build an ontology of terms that are related to the keywords. Terms in the profile are discovered by using WordNet lexical database [5]. To compare two research profiles, we adopt an algorithm called Multi-level Matching Algorithm with the neighbor search algorithm (MLMA+) proposed by Alasoud et al. [6], [7]. The algorithm considers name similarity and linguistic similarity between terms in the profiles. In addition, we add the concept of depth weights to ontology matching. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. The motivation behind this is that we would pay more attention to a similar matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since the terms at the bottom are considered more specialized areas of interest. A comparison between our methodology and MLMA+ is conducted using OAEI 2011 benchmark [8].
Section II of this paper discusses related work. Section III describes the algorithm for building ontological research profiles and a supporting tool. Section IV describes matching of the profiles. An evaluation of the methodology is presented
The Eighth International Conference on Computing and Information Technology IC2IT 2012
163
in Section V and the paper concludes in Section VI with future outlook.
II. RELATED WORK
Many researches analyze vast pools of information to find people with particular expertise, connection between these people, and shared interest among people. Some of these apply to research and academia. Tang et al. [9] present ArnetMiner which can automatically extract researcher profiles from the Web and integrate the profiles with publication information crawled from several digital libraries. The schema of the profiles is an extension of Friend-of-a-Friend (FOAF) ontology. They model the academic network using an author-conference-topic model to support search for expertise authors, expertise papers, and expertise conferences for a given query. Zhang et al. [10] construct an expertise network from posting-replying threads in an online community. A user’s expertise level can be inferred by the number of replies the user has posted to help others and whom the user has helped. Punnarut and Sriharee [11] use publication and research project information from Thai conferences and research council to build ontological research profiles of researchers. They use ACM computing classification system as a basis for expertise scoring, matching, and ranking. Trigo [12] extracts researcher information from Web pages of research units and publications from the online database DBLP. Text mining is used to find terms that represent each publication and then similarity between researchers with regard to their publications is computed. For further visualization of data, clustering and social network analysis are applied. Yang et al. [13] analyze personal interest in the online profile of a researcher and metadata of publications such as keywords, conference themes, and co-authors of the papers. By measuring similarity between such researcher data, social network of a researcher is constructed.
It is seen that in the approaches above, various mining techniques are used in extracting information and discovering knowledge about researchers and their relationships, and the major source of researcher information is bibliographic information in online libraries. We are interested in trying a different and more lightweight approach to finding similar interest between researchers and their degree of similarity. We focus on using an ontology building algorithm to create research profiles followed by an ontology matching algorithm to find similarity between the profiles.
III. BUILDING RESEARCH PROFILES
In this section and the next, we describe our methodology together with a supporting tool that has been developed. The first part of the methodology is building research profiles for researchers. Like other related work, keywords from researchers’ publications are used to represent research interest.
A. Researcher Information We retrieve researchers’ publication information during
ten-year period (year 2002-2011), i.e., author names, keywords, subject area, and year published, from ISI Web of Knowledge [3] and store in a MySQL database for the processing of the
Web-based tool developed by PHP. Using the tool (Fig. 1), we can specify a pair of authors, subject area, and year published and the tool will retrieve corresponding keywords from the database. The tool lists the keywords by frequency of occurrence, and from the list, we can select ones that will be used for building the profiles. In the figure, we use an example of two authors named B. Kijsirikul and C. A. Ratanamahatana under Computer Science area. Five top keywords are selected as starting terms for building their profiles.
B. Research Profile Building Algorithm In this step, we build a research profile as an ontology. We
follow the Obtaining Domain Ontology (ODO) algorithm proposed by An et al. [4] since it is intuitive and can automatically derive a domain-specific ontology from any items of descriptive information (i.e., keywords, in this case). The general idea is to augment the starting keywords with terms and hypernym (i.e., parent) relation from WordNet [5] to construct ontology fragments as Directed Acyclic Graphs. The iterative process of weaving WordNet terms and joining together the terms will tie ontology fragments into one ontology representing research interest. Fig. 2 and Fig. 3 are Kijsirikul’s and Ratanamahatana’s profiles built from their top five keywords. Steps in the algorithm and enhancements we make to tailor for ISI keywords are as follows.
1) Select starting keywords: Select keywords as starting terms. For Kijsirikul, they are Dimensionality reduction, Semi-supervised learning, Transductive learning, Spectral methods, and Manifold learning. For Ratanamahatana, they are Kolmogorov complexity, dynamic time warping, parameter-free data mining, anomaly detection, and clustering.
Figure 1. Specifying authors for profile building.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
164
Figure 2. Kijsirikul’s ontological profile.
Figure 3. Ratanamahatana’s ontological profile.
2) Find hypernyms in WordNet: For each term, look it up in WordNet for its hypernyms. Since a term may have several hypernyms, for simplicity, we select one with the maximum tag count which denote that the hypernym of a particular sense (or meaning) is most frequently used and tagged in various semantic concordance texts. In Fig. 3, the starting term clustering has the term agglomeration as its hypernym.
If the term does not exist in WordNet but may be in a plural form (i.e., it ends with “ches”, “shes”, “sses”, “ies”, “ses”, “xes”, “zes”, or “s”), change to a singular form before looking up for hypernyms again. It is possible that one starting keyword may be found to be a hypernym of another. It is also possible that no hypernym is found for the term. If so, follow step 3).
3) Define hypernyms: If the term does not exist in WordNet, do any of the following.
a) Use subject area as hypernym: If the term is a single word or an acronym, use the subject area of the author as its hypernym. Some ISI subject areas contain “&”, so in this case the words before and after “&” become hypernyms. For example, if the subject area is Science & Technology, Science and Technology become two parents of the term.
b) Use the generalized form of the term as hypernym: This is in accordance with the lexico-syntactic pattern technique in [14] which considers syntactic patterns of
sentences to discover hypernym relations, but here we consider the pattern of the term. That is, if the term is a noun phrase consisting of a head noun and modifier(s), generalize the term by removing one modifier at a time and look up in WordNet. If found, use that generalized form as the hypernym. For example, in Fig. 2, the term Dimensionality reduction has reduction as the head noun and Dimensionality as the modifier, removing the modifier leaves us with the head noun reduction which can be found in WordNet, so reduction becomes the parent. In Fig. 3, the term parameter-free data mining has mining as the head noun, and parameter-free and data as modifiers. Removing parameter-free leaves us with the more generalized term data mining which can be found in WordNet and hence it becomes the parent. In the case that none of the generalized forms of the term are in WordNet, use the subject area as the hypernym.
Some ISI keywords comprise a main term and an acronym in different formats, e.g., finite element method (FEM) or PTPC (percutaneous transhepatic portal catheterization). We consider the main term and apply the technique above. Therefore, the parent of finite element method (FEM) is method and the parent of PTPC (percutaneous transhepatic portal catheterization) is catheterization.
4) Build up ontology: Several parent-child relations that result from finding hypernyms become ontology fragments. Repeat steps 2) and 3) to further interweave hypernym terms until no more hypernyms can be found.
5) Merge ontology fragments: The final step is to merge the ontology fragments. If a term is found in two ontology fragments, the fragments are joined. At a joined node, if there are several upward paths from the node to the roots (from different ontology fragments), we pick the shortest path for simplicity. In Fig. 2, five ontology fragments, each with a starting keyword as the terminal node, can merge at learning, knowledge, and psychological feature nodes respectively. Since merging at psychological feature results in one single ontology, the parents of psychological feature (i.e., abstraction -> entity) are dropped.
Another example that will be discussed in the next section is the profile of an author named A. Sudsang under Robotics area (Fig. 4). Five starting keywords are Grasping, grasp heuristic, Caging, positive span, and capture regions.
Figure 4. Sudsang’s ontological profile.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
165
IV. MAT C H IN G RE S E A R C H PR O F IL E S
In this step, we match two ontological profiles. We adopt an effective algorithm called Multi-level Matching Algorithm with the neighbor search algorithm (MLMA+) proposed by Alasoud et al. [6], [7] since it uses different similarity measures to determine similarity between terms in the ontologies and also considers matching n terms in one ontology with m terms in another at the same time.
A. MLMA+ The original MLMA+ algorithm for ontology matching is
shown in Fig. 5. It has three phases.
1) Initialization phase: First, preliminary matching techniques are applied to determine similarity between terms in the two ontologies, S and T. Similarity measures that are used are name similarity (Levenshtein distance) and linguistic similarity (WordNet). Levenshtein distance determines the minimal number of insertions, deletions, and substitutions to make two strings equal [15]. For linguistic similarity, we determine semantic similarity between a pair of terms using a Perl module in WordNet::Similarity package [16]. Given Kijsirikul’s ontology S comprising n terms and Ratanamahatana’s ontology T comprising m terms, we compute a similarity matrix L(i, j) of size n x m which includes values in the range [0,1] called similarity coefficients, denoting the degree of similarity between the terms si in S and tj in T. A similarity coefficient is computed as an average of name similarity and linguistic similarity. For example, if Levenshtein distance between the terms s10 (change) and t23 (damage) is 0.2 and semantic similarity is 0.933, the similarity coefficient of these two terms is 0.567. The similarity matrix L for Kijsirikul and Ratanamahatana is shown in Fig. 6.
Then, a user-defined threshold th is applied to the matrix L to create a binary matrix Map0-1. The similarity coefficient that is less than the threshold becomes 0 in Map0-1, otherwise it is 1. In other words, the threshold determines which pairs of terms are considered similar or matched by the user. Fig. 6 also shows Map0-1 for Kijsirikul and Ratanamahatana with th = 0.5. It represents the state that s2 is matched to t14, s10 is matched to t14 and t23 etc. This Map0-1 becomes the initial state St0 for the neighbor search algorithm.
2) Neighbor search and evaluation phases: In this step, we search in the neighborhood of the initial state St0. Each neighbor Stn is computed by toggling a bit of St0, so the total number of neighbor states is n*m. An example of a neighbor state is in Fig. 7. The initial state and all neighbor states are evaluated using the matching score function v (1) of [6], [7]:
(1)
where k is the number of matched pairs and Map0-1 is Stn . The state with the maximum score is the answer to the matching; it indicates which terms in S and T are matched and the score represents the degree of similarity between S and T.
Figure 5. Ontology matching algorithm MLMA+ [6], [7].
Figure 6. L and initial Map0-1 based on MLMA+.
Figure 7. Example of a neighbor state of initial Map0-1 in Fig. 6.
B. Modification to MLMA+ We make a change to the initialization phase of MLMA+
by adding the concept of depth weights which is inspired by [17]. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. The motivation behind this is that we would pay more attention to a similar matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since the terms near the bottom are considered more specialized areas of interest. From Fig. 6, consider s2 = event and t14 = occurrence. The two terms have similarity coefficient = 0.51. They are relatively more generalized terms in the profiles compared to the pair s10 = change and t23 = damage with similarity coefficient = 0.567. But both pairs are equally considered as matched interest. We are in favor of the matched pairs that are relatively more specialized and are motivated to decrease the degree of similarity between generalized matched pairs by using a depth weight function w (2):
Algorithm Match (S, T) begin /* Initialization phase K ← 0 ; St0 ← preliminary_matching_techniques (S, T) ; Stf ← St0 ; /* Neighbor Search phase St ← All_Neighbors (Stn) ; While (K++ < Max_iteration) do /* Evaluation phase If score (Stn) > score (Stf) then Stf ← Stn ; end if Pick the next neighbor Stn St; St ← St – Stn ; If St = then return Stf ; end Return Stf ; end
thvjin
i
m
jMap
n
i
m
jjiLjiMapkLMapv
;,1 1
101 1
,,10
)10
(
The Eighth International Conference on Computing and Information Technology IC2IT 2012
166
wij = (rdepth(si) + rdepth(tj) ) / 2 ; wij is in (0,1] (2)
where rdepth(t) = relative distance of the term t from the root
of its ontology
= depth of the term in itsontology
height of ontology
t .
This depth weight will be multiplied with the similarity coefficient between si and tj to obtain a weighted similarity coefficient. Therefore the similarity matrix L(i, j) would change to include weighted similarity coefficients between the terms si and tj instead.
For s2 = event and t14 = occurrence in Fig. 2 and Fig. 3, rdepth(s2) = 2/8 and rdepth(t14) = 5/10. Their depth weight w would be 0.375 and hence their weighted similarity coefficient would change from 0.51 to 0.191 (0.375*0.51). But for s10 = change and t23 = damage, rdepth(s10) = 5/8 and rdepth(t23) = 7/10. Their depth weight w would be 0.663 and hence their weighted similarity coefficient would change from 0.567 to 0.376 (0.663*0.567). It is seen that the more generalized the matched terms, the more they are “penalized” by the depth weight. Any matched terms that are both the terminal node of the ontology would not be penalized (i.e., w =1). Fig. 8 shows the new similarity matrix L, with weighted similarity coefficients, and the new initial Map0-1 for Kijsirikul and Ratanamahatana where th = 0.35. Note that for the pair s2 = event and t14 = occurrence, and the pair s10 = change and t14 = occurrence they are considered matched in Fig. 6 but are relatively too generalized and considered unmatched in Fig. 7. For s10 = change and t23 = damage, they survive the penalty and are considered matched in both figures.
C. Matching Results of Example Table I shows matching results of the example when the
original MLMA+ and its modification are used. Both algorithms agree that Kijsirikul’s profile (Machine Learning) is more similar to Ratanamahatana’s (Data Mining) than Sudsang’s (Robotics). Matched pairs between Kijsirikul and Ratanamahatana are listed in Table II. MLMA+ gives a big list of matched pairs including those very generalized terms, while depth weights filter some out, giving a more useful list.
Figure 8. L and initial Map0-1 based on MLMA+ with depth weights.
TABLE I. MATCHING SCORES
Algorithm Author 1 Author 2 Matching Score Ratanamahatana 0.627
MLMA+ Kijsirikul Sudsang 0.581 Ratanamahatana 0.411 MLMA+ with
depth weights Kijsirikul
Sudsang 0.372
TABLE II. MATCHED PAIRS
Algorithm Matched Pairs MLMA+ (psychological feature, psychological feature),
(event, event), (event, occurrence), (event, change), (knowledge, process), (power, process), (power, event), (power, quality), (process, process), (process, processing), (act, process), (act, event), (act, change), (action, process), (action, change), (action, detection), (basic cognitive process, basic cognitive process), (change, event), (change, occurrence), (change, change), (change, damage), (change, deformation), (change of magnitude, change), (reduction, change), (knowledge, perception)
MLMA+ with depth weights
(basic cognitive process, basic cognitive process), (change, change), (change, damage), (change, deformation), (change, warping), (reduction, change), (reduction, detection), (reduction, damage), (reduction, deformation), (change of magnitude, deformation)
V. EVALUATION AND DISCUSSION Our ontology matching algorithm is evaluated using OAEI
2011 benchmark test sample suite [8]. The benchmark provides a number of test sets in a bibliographic domain, each comprising a test ontology in OWL language and a reference alignment. Each test ontology is a modification to the reference ontology #101 and is to be aligned with the reference ontology. Each reference alignment lists expected alignments. So in the test set #101, the reference ontology is matched to itself, and in the test set #n, the test ontology #n is matched to the reference ontology. The quality indicators we use are precision (3), recall (4), and F-measure (5).
no.of expected alignments found as matched by algo.Pr ecision
no.of matched pairs found by algo. (3)
no.of expected alignments found as matched by algo.Recall
no.of expected alignments (4)
2 x Precision x RecallF measure
Precision Recall
(5)
Table III shows the evaluation results with th = 0.5. We group the test sets into four groups. Test set #101-104 contain test ontologies that are more generalized or restricted than the reference ontology by removing or replacing OWL constructs that make the concepts in the reference ontology generalized or restricted. Test set #221-247 contain test ontologies with structural change such as no specialization, flattened hierarchy, expanded hierarchy, no instance, no properties. The quality of both algorithms with respect to these two groups is quite similar since these modifications do not affect string-based and linguistic similarities which are the basis of both algorithms. Test set #201-210 contain test ontologies which relate to change of names in the reference ontology, such as by renaming with random strings, misspelling, synonyms, using certain naming convention, and translation into a foreign language. Both algorithms are more sensitive to this test set. Test set #301-304 contain test ontologies which are actual bibliographic ontologies.
According to an average F-measure, MLMA+ with depth weights is about the same quality as MLMA+ as it gives better precision but lower recall. MLMA+ discovers a large number
The Eighth International Conference on Computing and Information Technology IC2IT 2012
167
of matched pairs whereas depth weights can decrease this number and hence precision is higher. But at the same time, recall is affected. This is because the reference alignments only lists pairs of terms that are expected to match. That is, for example, if the test ontology and the reference ontology contain the same term, the algorithm should be able to discover a match. But MLMA+ with depth weights considers the presence of the terms in the ontologies as well as their location in the ontologies. So an expected alignment in a reference alignment may be considered unmatched if they are near the root of the ontologies and are penalized by the algorithm.
The user-defined threshold th in the initialization phase of MLMA+ is a parameter that affects precision and recall. If th is too high, only identical terms from the two ontologies would be considered as matched pairs (e.g., (psychological feature, psychological feature)), and these identical pairs mostly are located near the root of the ontologies. We see that discovering only identical matched pairs are not very interesting given that the benefit of using WordNet and linguistic similarity between non-identical terms is not present in the matching result. On the contrary, if th is too low, there would be proliferation of matched pairs because, even a matched pair is penalized by depth weight, its weighted similarity coefficient remains greater than the low th. The values th that we use for the data set in the experiment trades off these two aspects; it is the highest threshold by which the matching result contains both identical and non-identical matched pairs.
The complexity of the ODO algorithm for building an ontology S depends on the number of terms in S and the size of the search space when joining any identical terms in S into single nodes, i.e., O( ) where the number of ontology terms n = number of starting keywords * depth of S, given that, in the worst case, all starting keywords have the same depth. For MLMA+ and MLMA+ with depth weights, the complexity depends on the size of the search space when matching two ontologies S and T, i.e., O((n*m)2) when n and m are the size of S and T respectively.
VI. CONCLUSION
This work presents an ontology-based methodology and a supporting Web-based tool for (1) building research profiles from ISI keywords and WordNet terms by applying the ODO algorithm, and (2) finding similarity between the profiles using MLMA+ with depth weights. An evaluation using the OAEI 2011 benchmark shows that depth weights can give good precision but lower recall.
TABLE III. EVALUATION RESULTS
MLMA+ MLMA+ with Depth Weights Test Set Prec. Rec. F-
measure Prec. Rec. F-measure
#101-104 0.74 1.0 0.85 0.93 0.84 0.88 #201-210 0.35 0.24 0.26 0.68 0.18 0.27 #221-247 0.71 0.99 0.82 0.94 0.66 0.75 #301-304 0.56 0.75 0.64 0.90 0.57 0.68
Average 0.59 0.74 0.64 0.86 0.56 0.64
For future work, further evaluation using a larger corpus and evaluation on performance of the algorithms are expected. An experience report on practical use of the methodology will be presented. It is also possible to adjust the ontology matching step so that the structure of the ontologies and the context of the terms are considered. In addition, we expect to explore if the methodology can be useful for discovering potential cross-field collaboration.
REFERENCES
[1] A. Cox, “What are communities of practice? A comparative review of
four seminal works,” J. of Information Science, vol. 31, no. 6, pp. 527-540, December 2005.
[2] Y. Okubo, Bibliometric Indicators and Analysis of Research Systems: Methods and Examples, Paris: OECD Publishing, 1997.
[3] ISI Web of Knowledge, http://www.isiknowledge.com, Last accessed: January 24, 2012.
[4] Y. J. An, J. Geller, Y. Wu, and S. A. Chun, “Automatic generation of ontology from the deep Web,” in Procs. of 18th Int. Workshop on Database and Expert Systems Applications (DEXA’07), 2007, pp. 470-474.
[5] WordNet, http://wordnet.princeton.edu/, Last accessed: January 24, 2012.
[6] A. Alasoud, V. Haarslev, and N. Shiri, “An empirical comparison of ontology matching techniques,” J. of Information Science, vol. 35, pp. 379-397, March 2009.
[7] A. Alasoud, V. Haarslev, and N. Shiri, “An effective ontology matching technique,” in Procs. of 17th Int. Conf. on Foundations of Intelligent Systems, 2008, pp. 585-590.
[8] Ontology Alignment Evaluation Initiative 2011 Campaign, http://oaei.ontologymatching.org/2011/, Last accessed: January 24, 2012.
[9] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “ArnetMiner: Extraction and mining of academic social networks,” In Procs. of 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2008), 2008, pp. 990-998.
[10] J. Zhang, M. Ackerman, and L. Adamic, “Expertise network in online communities: structure and algorithms,” In Procs. of 16th Int. World Wide Web Conf. (WWW 2007), 2007, pp. 221-230.
[11] R. Punnarut and G. Sriharee, “A researcher expertise search system using ontology-based data mining,” in Procs. of 7th Asia-Pacific Conference on Conceptual Modelling (APCCM 2010), 2010, pp. 71-78.
[12] L. Trigo, “Studying researcher communities using text mining on online bibliographic databases,” In Procs. of 15th Portuguese Conf. on Artificial Intelligence, 2011, pp. 845-857.
[13] Y. Yang, C. A. Yueng, M. J. Weal, and H. C. Davis, “The researcher social network: A social network based on metadata of scientific publications,” In Procs. of Web Science Conf. 2009 (WebSci 2009), 2009.
[14] M. A. Hearst, “Automated discovery of WordNet relation,” in Wordnet: An Electronic Lexical Database and Some of its Applications, Cambridge, MA: MIT Press, 1998, pp. 132-152.
[15] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol.33, pp. 31-88, March 2001.
[16] Wordnet::Similarity, http://sourceforge.net/projects/wn-similarity, Last accessed: January 24, 2012.
[17] H. Yang, S. Liu, P. Fu, H. Qin, and L. Gu, “A semantic distance measure for matching Web services,” in Procs. of Int. Conf. on Computational Intelligence and Software Engineering (CiSE), 2009, pp. 1-3.
2
n
The Eighth International Conference on Computing and Information Technology IC2IT 2012
168
Automated Software Development Methodology:
An agent oriented approach
Sudipta Acharya
Dept. of Information technology National Institute of Technology
Durgapur, India [email protected]
Prajna Devi Upadhyay
Dept. of Information technology
National Institute of Technology Durgapur, India
Animesh Dutta
Dept. of Information technology
National Institute of Technology
Durgapur, India
Abstract— In this paper, we propose an automated software
development methodology. The methodology is conceptualized
with the notion of agents, which are autonomous goal-driven
software entities. They coordinate and cooperate with each other,
like humans in a society to achieve some goals by performing a
set of tasks in the system. Initially, the requirements of the newly
proposed system are captured from stakeholders which are then
analyzed in goal oriented model. Finally, the requirements are
specified in the form of goal graph, which is the input to the
automated system. Then this automated system generates MAS
(Multi Agent System) architecture and coordination of the agent
society to satisfy the set of requirements by consulting with the
domain ontology of the system.
Keywords-Agent; Multi Agent System;Agent Oriented Software
Engineering; Domain Ontology; MAS Architecture; MAS
Coordination; Goal Graph.
I. INTRODUCTION
A. Agent and Multi agent system
An agent[1, 2] is a computer system or software that can act autonomously in any environment, makes its own decisions about what activities to do, when to do, what type of information should be communicated and to whom, and how to assimilate the information received. Multi-agent systems (MAS) [1, 2] are computational systems in which two or more agents interact or work together to perform a set of tasks or to satisfy a set of goals.
B. Agent Oriented Software Engineering
The advancement from assembly level programming to
procedures and functions and finally to objects has taken place
to model computing in a way we interpret the world. But there
are inherent limitations in an object that makes it incapable of
modeling a real world entity. It was for this reason that we
move to agents and Multi agent systems, which model a real
world entity in a better way. As agent technology has become
more accepted, agent oriented software engineering (AOSE)
also has become an important topic for software developers
who wish to develop reliable and robust agent-based software
systems [3, 4, 5]. Methodologies for AOSE attempt to provide
a method for engineering practical multi agent systems.
Recently, transformation systems based on formal models to
support agent system synthesis are emerging fields of
research. There are currently few AOSE methodologies for
multi agent systems, and many of those are still under
development.
II. RELATED WORK
Recent work has focused on applying formal methods to
develop a transformation system to support agent system
synthesis. Formal transformation systems [6, 7, 8] provide
automated support to system development, giving the designer
increased confidence that the resulting system will operate
correctly, despite its complexity. In [9] authors have proposed
a Goal oriented language GRL and a scenarios oriented
architectural notation UCM to help visualize the incremental
refinement of architecture from initially abstract description.
But the methodology proposed is informal and due to this the
architecture will vary from developer to developer. In [10, 11]
a methodology for multi agent system development based on
goal model is proposed. Here, MADE (Multi Agent
Development Environment) tool has been developed to reduce
the gap between design and implementation. The tool takes the
agent design as input and generates the code for
implementation. The agent design has to be provided
manually. Automation has not been shown for generation of
design from user requirements. A procedure to map the
requirements to agent architecture is proposed in [12]. The
TROPOS methodology for building agent oriented software
system is introduced in [13]. But the methodologies proposed
in both [12] and [13] are informal approaches.
III. SCOPE OF WORK
There are very few AOSE methodologies for automated
design of the system from user requirements. But, most of the
work follows an informal approach due to which the system
design may not totally satisfy the user requirements. Also the
system design varies from developer to developer. There is a
need to reduce the gap between the requirements specification
and agent design and to develop a standard methodology
which can generate the design from user requirements
irrespective of the developers. In this work we have
concentrated on developing a standard methodology by which
The Eighth International Conference on Computing and Information Technology IC2IT 2012
169
we can generate the design of software from user requirements
which will be developer independent.
In this paper, we develop an automated system which takes
the user requirements as input and generates the MAS
architecture and coordination with the help of domain
knowledge. The basic requirements are analyzed in a goal
oriented fashion [14] and represented in the form of goal graph
while the domain knowledge is represented with the help of
ontology [15]. The output of the developed system is MAS
architecture which consists of a number of agents and their
capabilities and MAS coordination represented through Task
Petri Nets. The Task Petri Nets tool can model the
coordination among the agents to maintain the inherent
dependencies between the tasks.
IV. PROPOSED METHODOLOGY
Fig. 1 represents the architecture of our proposed automated system. The basic requirements are taken from the user as input. Since Requirements Analysis is an informal process, the input requirements can be captured from the user in the form of a text file or any other desirable format. These requirements are further analyzed and represented in the form of a Goal Graph. The domain knowledge is also an input and is represented in the form of ontology. The automated system returns the MAS Architecture and MAS Coordination as output. The MAS Coordination is represented in the form of Task Petri Nets. Thus, the automated system takes the requirements and the domain knowledge as input and generates the MAS Architecture and MAS Coordination as output. So, we can say,
Figure 1. Architecture of the proposed automated system
MAS Architecture= f (Requirements, Domain ontology)
MAS Coordination=f (Requirements, Domain ontology, MAS architecture)
The architecture of the proposed automated system is described in the following sub-section.
A. Requirements represented by Goal Graph
Agents in MAS are goal oriented i. e. all agents perform collaboratively to achieve some goal. The concept of goal has been used in many areas of Computer Science for quite some time. In AI, goals have been used in planning to describe desirable states of the world since the 60s. More recently, goals
have been used in Software Engineering to model requirements and non-functional requirements for a software system. Formally we can define Goal Graph as G=(V, E), consisting of
A set of nodes V=V1, V2,…,Vn where each Vi is a goal to be achieved in a system, 1<=i<=n.
A set of edges E. There are two types of edges, represented by and .
A function, subgoal : (V×V) Bool. subgoal (Vi, Vj)=true if Vj is an immediate sub goal of Vi.
Function hb: (VxV)Bool. hb(Vi, Vj)=true if the user specifies that goal represented by Vi should be satisfied before the goal represented by Vj is satisfied.
An edge exists between two vertices Vi and Vj
if subgoal(Vi,Vj)=true, Vi, Vj∈ V.
An edge exists between two vertices Vi and
Vj if hb(Vi,Vj)=true, Vi, Vj∈ V.
B. Domain Knowledge represented by Ontology
A domain ontology [15] defined on a domain M is a tuple O = (TCO, TRO, I, conf, <, B, IR), where we have extended the tuple definition by adding another function IR as per our requirements.
TCO= c1, c2,…,ca, is the set of concept types defined in domain M. Here, TCO= task, goal. In diagram, a concept is represented by
TRO= consists_of, happened_before. In diagram, a relation is represented by
I is the set of instances of TCO, from the domain M.
conf: ITCO, it associates each instance to a unique concept type.
≤: (TC X TC) ∪ (TR X TR) true, false, <(c,d)=true indicates c is a subtype of d
B: TR -> ℘(TC) where ∀ TR B(r) = c1,..,cn, where n is a variable associated with r. The set c1,..,cn is an ordered set and could contain duplicate elements. Each ci is called an argument of r. The number of elements of the set is called the arity (or valence) of the relation. B(consists)=goal,…,goal, task,…,task
IR: TRO℘(I), where ∀ ai ∈ ℘(I), if ai is the ith element of ℘(I), then conf(ai) = ti, where ti is the ith element of B(TRO).
C. Semantic Mapping from requirements to Ontology
Concepts
The process by which the basic keywords of the leaf node sub goal are mapped into concepts in the Ontology is called Semantic Mapping. In this paper, the aim of semantic mapping is to find out tasks from Domain Ontology, required to be performed to achieve a sub-goal given as input from the Goal Graph. Let there be a set of task concepts T=t1,t2...tn associated with a consists_of relation in an ontology. Let there
The Eighth International Conference on Computing and Information Technology IC2IT 2012
170
be set of goal concepts G=g1,g2...gm also associated with that consists_of relation. Now let from user side requirements come of which after Requirements Analysis goal graph consists set of leaf node sub goals, G0=G1,G2,G3....Gp. Let ky be a function that maps a sub-goal to its set of keywords. The set of keywords for sub goal Gi Є G0 can be represented by ky(Gi)=ky1,ky2...kyd. Now the set of tasks T will be performed to achieve sub goal Gi iff either the mapping
f : ky(Gi)G is a bijective mapping where Gi Є G0 , or
there exists a subset of GO, Gi, Gj ,…, Gk⊆ GO, such that
f : ky(Gi) U ky(Gj) U....U ky(Gk) G is a bijective mapping.
D. MAS Architecture
MAS architecture consists of set of agents with their capability sets i.e set of tasks that an agent can perform. Formally we can define agent architecture as,
< AgentID, capability set>
where AgentID is unique identification number of agent, and capability set is set of tasks t1, t2,......,tn that the corresponding agent is able to perform. MAS architecture can be defined as a set of agents with their corresponding architectures.
E. MAS Coordination represented by Task Petri Nets
A Task Petri Nets is an extended Petri Nets tool that can model the MAS coordination. It is a six tuple, TPN = (P, TR, I, O, TOK, Fn) where
P is a finite set of places. There are 8 types of places, P= Pt ∪ Ph ∪ Pc ∪ Pe ∪ Pf ∪ Pr ∪ Pa∪ Pd. Places Ph, Pc, Pe, Pf exist for each task already identified by the interface agent. The description of the different types of places is:
1. Ph: A token in this place indicates that the task represented by this place can run, i.e. all the tasks that were required to be completed for this task to run are completed.
2. Pc: A token in this place indicates that an agent has been assigned for this task.
3. Pe: A token in this place indicates agent and resources have been allocated for the task represented by the place and the task is under execution by the allocated agent.
4. Pf: A token in this place indicates that the task represented by this place has finished execution.
5. Pr: such a place exists for each type of a resource in
the system ,∀ ri ∃ Pri ri ∈ R, 1≤i≤q
6. Pa: such a place exists for each instance of an agent in
the system ,∀ ai ∃ Pai ai ∈ A, 1≤i≤p
7. Pt: it is the place where the tasks identified by the interface agent initially reside.
8. Pd: such place is created dynamically after the agent has been assigned for the task and the agent decides to divide the tasks into subtasks. For each subtask, a new place is created.
TR is the set of transitions. There are 5 types of transitions TR=th ∪ tc ∪ te ∪ tf ∪ td, where th , te , tf exist for every task identified by the interface agent.
1. th: This transition fires if the task it represents is enabled i.e. all the tasks which should be completed for the task to start are complete.
2. tc: This transition fires if the task it represents is assigned an agent which is capable of performing it.
3. te: This transition fires if the all resources required by the task it represents are allocated to it.
4. tf: This transition fires if the task represented by the transition is complete.
5. td: This transition is dynamically created when the agent assigned for the task it represents decides to split the task further into sub-tasks. The subnet that is formed dynamically consists of places and transitions all of which are categorized as Pd or td respectively.
I is the set of input arcs, which are of the following types
1. I1=Pt X th: task checked for dependency
2. I2=PrX te: request for resources
3. I3=Pe X tf: task completed
4. I4=Pf X th: interrupt to successor task
5. I5=PcXtd∪ PaXtd∪ PrXtd∪ PdXtf: input arcs of the subnet formed dynamically
O is the set of output arcs, which are of the following types:
1. O1: thXPh: task not dependent on any other task
2. O2: tcXPc: agent assigned
3. O3: teXPe: resource allocated
4. O4: tfXPr: resource released
5. O5: tfXPf: Task completed by agent
6. O6: tfXPa: agent released
7. O7: tdXPd: output arcs of the subnet formed dynamically
TOK is the set of color tokens present in the system, TOK=TOK1,TOK2,…,TOKX, where each TOKi, 1≤i≤x, is associated with a function assi_tok defined as:
assi_tok: TOK Category X Type X N, where, Category = set of all categories of tokens in the system= T, R, A, Type = set of all types of each categoryi ∈ Category i.e. Type= T ∪ R ∪ A, N is the set of natural numbers. Let assi_tok(TOKi)=(categoryi,
The Eighth International Conference on Computing and Information Technology IC2IT 2012
171
typei, ni). The function assi_tok satisfies the following constraint:
∀ TOKi (categoryi=R)(typei ∈ R) ∧ (1≤ni≤ inst_R(typei))
∀ TOKi (categoryi=A)(typei ∈ A) ∧ (1≤ni≤ inst_A(typei))
∀ TOKi (categoryi=T)(typei ∈ T) ∧ (ni=1)
assi_tok defines the category, type and number of instances of each token.
Fn is a function associated with each place and token. It is defined as:
Fn: P X TOK℘(TIME X TIME). For a token TOKk ∈ TOK, 1≤k≤x, and place Pl ∈ P, Fn(Pl, TOKk)=(ai, aj), ai is the entry time of TOKk to place Pl and aj is the exit time of TOKk from place Pl. For a token entering and exiting a place multiple times, |Fn(Pl,TOKk)|=number of times TOKk entered the place Pl.
The process by which MAS architecture and MAS coordination is generated from requirements is shown as a flowchart in Fig. 2.
Figure 2. Flowchart of the proposed methodology
The Eighth International Conference on Computing and Information Technology IC2IT 2012
172
V. CASE STUDY
Let us start with the case study by applying our proposed methodology. We take “Library system” as our case study application. Fig. 3 shows the ontology of a Library System. The ontology consists of some concepts and relations. There is a TASK concept type in the ontology which describes the task that should be performed to achieve some goal. There are other concepts that collectively describe some sub-goal to be achieved in the library system. For e.g. the concept types “Check”, “Validity”, “Member”, collectively describe the sub-goal “Check the membership validity”. There are two types of relations i) consists_of and ii) happened_before. Here we denote “happened_before” relationship by “H-B”. The “consists_of” relation exists between some set of concepts describing a sub- goal and a set of instances of TASK concept. The “happened_before” relation exists between two instances of TASK concept. In the figure, the “consists_of” relation has incoming arcs from the concepts types “Check”, “Validity, “Member” and outgoing arcs to TASK concepts “Get library identity card of member”, “Check for validity of that ID card”. It means that these two tasks have to be performed to achieve sub-goal described by concepts “Check”, “Validity”, “Member” i.e. tasks “Get library identity card of member” and “Check for validity of that ID card” have to be performed to
achieve sub-goal “Check the membership validity”. The happened_before relationship exists between these two tasks which means that task “Check for validity of that ID card” cannot start until task “Get library identity card of member” is completed. Now consider from user side requirements come as “Delete account of member with member id <i> and book with book id <j> from database.”It is the main goal.
Step 1: We have to perform goal oriented Requirements
Analysis of main requirements. It is an informal process
performed by the Requirements Analysts and after
Requirements Analysis, it is represented by Goal Graph shown
in Fig. 4.
Step 2: The leaf node sub goals are given as input to the
automated system. By semantic mapping [16, 17] system maps
each basic keyword of leaf sub goals to the goal concepts of
ontology, and finds out set of tasks required to be performed to
achieve those sub-goals. This is shown in Fig. 5.
Step 3: The tasks that we get from step 2 are used to form task
graph. Dependency between these tasks is known from
ontology. The task graph is shown in Fig. 6 where task A
implies “Check that requirement of book id <j> < threshold, if
yes then continue, else stop.
Figure 3. Ontology Diagram of Library System
The Eighth International Conference on Computing and Information Technology IC2IT 2012
173
Figure 4. Goal Graph representation of basic requirements
Figure 5. Procedure for Semanic Mapping
Task B implies “Check both book & member database
whether member id <i> has not returned book, and any book
<j> is not returned by any member.”
i. e in task B two checking operations are there.
Task C implies “Delete member id <i> account.”
Task D implies “Remove book id <j> from library”
Task E implies “Delete entry of book id <j> from database”
Figure 6. Task Graph for the set of tasks found from Semantic Mapping
Step 4: Using Task Graph of Fig. 6, we find out the number of
agents and their capability set following the methodology
shown as a flowchart in Fig. 2. The maximum number of
concurrent agents at any level is 2, so we create 2 agents- A1
and A2. Let C be assigned to A1’s capability set and D to A2’s
capability set, <A1, C>, <A2, D>. Both C and D have
single predecessor, B. So, B is added to the capability set of
either A1 or A2. Let it be added to the capability set of A1. So,
we have <A1, B, C>. Now, B has a single predecessor, A.
So, A is added to the capability set of A1. So, we have <A1,
A, B, C>. There are no other predecessors at level higher
than A. D has a single successor, E. So, E is added to the
capability set of A2. So, we have <A2, D, E>. The total
number of agents deployed is 2 and the MAS architecture is
<A1, A, B, C>, <A2, D, E>.
Step 5: Using the Task Graph of Fig. 6 and MAS architecture
developed in step 4, MAS coordination is formed i.e. to satisfy
user requirements, how a set of required agents (A1, A2) will
perform a set of required tasks (A, B, C, D, E) collaboratively
can be represented by Task Petri Nets shown in Fig. 7.
Figure 7. Task Petri Nets representation of MAS coordination
VI. CONCLUSION
In this paper, we have developed an automated system to
generate MAS architecture and coordination from the user
requirements and domain knowledge. It is a formal
methodology which is developer independent i.e. it produces
same MAS architecture and coordination for the same set of
requirements and domain knowledge. The future work is to
include a verification module to check whether the developed
architecture satisfies the requirements. The module can work
at two levels, firstly after Requirements Analysis, it can check
whether the analysis satisfies main requirements, and
secondly, it can verify whether the MAS coordination satisfies
main requirements.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
174
REFERENCES
[1] G. Weiss, Ed., Multiagent systems: a modern approach to distributed artificial intelligence, MIT Press (1999)
[2] M. J. Wooldridge, Introduction to Multiagent Systems, John Wiley & Sons, Inc(2001)
[3] N. Jennings, On Agent-based Software Engineering, Artificial Intelligence: 117 (2000) 277-296
[4] J. Lind, “Issues in Agent-Oriented Software Engineering”, In P. Ciancarini , M. Wooldridge (eds.), Agent-Oriented Software Engineering: First International Workshop”, AOSE 2000. Lecture Notes in Artificial Intelligence, Vol. 1957. Springer-Verlag, Berlin
[5] M. Wooldridge, , P. Ciancarini, “Agent-Oriented Software Engineering: the State of the Art”, In P. Ciancarini, , M. Wooldridge, (eds.), Agent-Oriented Software Engineering: First International Workshop, AOSE 2000. Lecture Notes in Artificial Intelligence, Vol. 1957. Springer-Verlag, Berlin Heidelberg (2001) 1-28
[6] C. Green, D. Luckham, R. Balzer , et al, “Report on a Knowledge-Based Software Assistant”. In C. Rich, , R. C. Waters, (eds.), Readings in Artificial Intelligence and Software Engineering. Morgan Kaufmann, San Mateo, California (1986) 377-428
[7] T. C. Hartrum , R Graham, “The AFIT Wide Spectrum Object Modeling Environment: An AWESOME Beginning”, Proceedings of the National Aerospace and Electronics Conference. IEEE (2000) 35-42
[8] R. Balzer, T. E. Cheatham, Jr., and C. Green, “Software Technology in the 1990’s: Using a new Paradigm,”, Computer, pp. 39-45, Nov 1983.
[9] L. Liu, E. Yu, “From Requirements to Architectural Design –Using Goals and Scenarios”.
[10] Z. Shen, C. Miayo, R. Gay, D. Li, “Goal Oriented Methodology for Agent System Development”, IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006.
[11] S. Zhiqi, “Goal oriented Modelling for Intelligent Agents and their Applications”, Ph.D. Thesis, Nanyang Technological University, Singapore, 2003.
[12] Clint H. Sparkman, Scott A. DeLoach, Athie L. Self, “Automated Derivation of Complex Agent Architectures from Analysis Specifications”, Proceedings of the Second International Workshop On Agent-Oriented Software Engineering (AOSE-2001), Montreal, Canada, May 29th 2001.
[13] P. Bresciani, A. Perini, P. Giorgini, F. Giunchiglia, J. Mylopoulos, “Tropos: An Agent-Oriented Software Development Methodology”, Autonomous Agents and Multi-Agent Sytems, 8, 203–236, 2004.
[14] Paolo Giorgini, John Mylopoulos, and Roberto Sebastiani, “Goal-Oriented Requirements Analysis and Reasoning in the Tropos Methodology”, Engineering Applications of Artificial Intelligence,Volume 18, 159-171, 2005.
[15] P.H.P. Nguyen, D. Corbett, “A basic mathematical framework for conceptual graphs”, In: IEEE Transactions on Knowledge and Data Engineering, Volume 18, Issue 2, 2005.
[16] Haruhiko Kaiya, Motoshi Saeki, “Using Domain Ontology as Domain Knowledge for Requirements Elicitation”, 14th IEEE International Requirements Engineering Conference (RE'06)
[17] Masayuki Shibaoka, Haruhiko Kaiya, and Motoshi Saeki, “GOORE : Goal-Oriented and Ontology Driven Requirements Elicitation Method”, J.-L. Hainaut et al. (Eds.): ER Workshops 2007, LNCS 4802, pp. 225–234, 2007.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
175
Agent Based Computing Environment for Accessing
Previleged Servics
Navin Agarwal
Dept. of Information Technology
National Institute of Technology, Durgapur
Animesh Dutta
Dept. of Information Technology
National Institute of Technology, Durgapur
Abstract— In this paper we propose an application for
accessing privileged services on the web, which is deployed on
JADE (Java Agent Development Framework) platform. There
are many Organizations/ Institutes which have subscribed to
certain services inside their network, and these will not be
accessible to people who are a part of the
Organization/Institute when they are outside their network
(for example in his residence). Therefore we have developed
two software agents; the person will request the Client Agent
(which will be residing outside the privileged network) for
accessing the privileged services. The Client Agent will interact
with the Server Agent (which will be residing inside the
network which is subscribed to privileged services), which will
process the request, and send the desired result back to the Client Agent.
I. INTRODUCTION
Many Organizations/Institutes have subscription to certain services inside their network, for example here at NIT Durgapur there is subscription of IEEE and ACM. When outside the network, these services cannot be accessed. We plan to address this problem and also automate the whole process so that so that human effort is reduced.
To solve the problem we will build an agent based system, where multiple agents will interact with each other to solve the problem. When we talk about multiple agents interacting, the system becomes a Multi-Agent system, descriptions of which are given below.
A. Agent
An agent is a computer system or software that can act autonomously in any environment. Agent autonomy relates to an agent’s ability to make its own decisions about what activities to do, when to do, what type of information should be communicated and to whom, and how to assimilate the information received. An agent in the system is considered a locus of problem-solving activity; it operates asynchronously with respect to other agents. Thus, an intelligent agent inhabits an environment and is capable of conducting autonomous actions in order to satisfy its design objective [1-5]. Generally speaking, the environment is the aggregate of surrounding things, conditions, or influences with which the agent is interacting. Data/information is “sensed” by the agent. This data/information is typically called “percepts”.
The agent operates on the percepts in some fashion and generates “actions” that could affect the environment. This general flow of activities, i.e., sensing the environment, processing the sensed data/ information and generating actions that can affect the environment, characterizes the general behavior of all agents.
B. MAS(Multi Agent System)
Multi-agent systems (MASs) [2, 5] are computational systems in which two or more agents interact or work together to perform a set of tasks or to achieve some common goals [5-8]. Agents of a multi-agent system (MAS) need to interact with others toward their common objective or individual benefits of themselves. A multi-agent system can be studied as a computer system that is concurrent, asynchronous, stochastic and distributed. A multi agent system permits to coordinate the behavior of agents, interacting and communicating in an environment, to perform some tasks or to solve some problems. It allows the decomposition of complex task in simple sub-tasks which facilitates its development, testing and updating.
The client agent outside the network which is subscribed to certain services need to interact with some agent residing inside the network, which will do the work on behalf of the user and send him back the result. In this paper we propose the whole architecture of the system, and how different agents will interact with each other.
To develop the MAS, we will use JADE which is a software framework fully implemented in Java language. It simplifies the implementation of multi-agent systems through a middle-ware that claims to comply with the FIPA specifications and through a set of tools that supports the debugging and deployment phase. The agent platform can be distributed across machines (which not even need to share the same OS) and the configuration can be controlled via a remote GUI. The configuration can even be changed at run-time by creating new agents and moving agents from one machine to another one, as and when required. The only system requirement is the Java Run Time version 5 or later.
The communication architecture offers flexible and efficient messaging, where JADE creates and manages a queue of incoming ACL messages, private to each agent. Agents can access their queue via a combination of several
The Eighth International Conference on Computing and Information Technology IC2IT 2012
176
modes: blocking, polling, timeout and pattern matching based. The full FIPA communication model has been implemented and its components have been clearly distinct and fully integrated: interaction protocols, envelope, ACL, content languages, encoding schemes, ontologies and, finally, transport protocols. The transport mechanism, in particular, is like a chameleon because it adapts to each situation, by transparently choosing the best available protocol. Most of the interaction protocols defined by FIPA are already available and can be instantiated after defining the application-dependent behavior of each state of the protocol. SL and agent management ontology have been implemented already, as well as the support for user-defined content languages and ontologies that can be implemented, registered with agents, and automatically used by the framework.
II. RELATED WORK
Agent-based models have been used since the mid-1990s to solve a variety of business and technology problems. Examples of applications include supply chain optimization [9] and logistics [10], distributed computing [11], workforce management [12], and portfolio management [13]. They have also been used to analyze traffic congestion [14]. In these and other applications, the system of interest is simulated by capturing the behavior of individual agents and their interconnections. In this paper [15] a framework for constructing application in mobile computing environment has been proposed. In this framework an application is partitioned into two pieces, one runs on a mobile computer and another runs on a stationary computer. They are constructed by composing small objects, in which the stationary computer does the task for the mobile computer. This system is based on the service proxy, and is not autonomous. In our work we are building our system based on agents which adds a lot of flexibility and is autonomous. A Multi-Agent system [16] for accessing remote energy meters from electricity board is related to this work. In this the server said to be the host is located in the electricity board, and all the customers are the clients connected with the server. This MAS system helps in automating the task and thus replacing the human agents. It is similar to our scenario where we are automating the task of downloading papers from IEEE/ACM sites, and replacing human agents which can do the task being inside the privileged network. In this paper [17] architecture has been proposed for secure and simplified access to home appliances using Iris recognition, adding an additional layer of security and preventing unauthorized access to the home appliances. This model is also based on server and client approach, where the server will reside inside the home, and client will reside outside the home and send request to the server for performing task on behalf of the user. Advanced method for downloading web-pages from the internet has been proposed here [18], we will be using many concepts from these to improve the working of the server agent and more utilization of bandwidth.
III. PROBLEM OVERVIEW
There are many networks which have privilege accessibility to many sites and servers. For example being
inside NIT Durgapur, there is no authentication required for
downloading papers and other documents from IEEE and
ACM sites.
A user who has the right to access that privileged network,
but if he is outside that network he will not be able to. There
can be scenarios in which an Institute or Organization, can
pay for some services to be accessed inside their network. In
such situations the user has to be inside the network to enjoy
those services or, they can access the network from outside
by means of a Proxy Server (There are some more
possibilities).
IV. SCOPE OF WORK
The aim of this work is to automate this whole process. Make the work of the user easy and take advantage of the services or privilege that he is entitled to access being a part of the Institute or Organization. Developing the agent in JADE allows us to implement it for Mobile devices also. The only requirement for running any JADE agent is, Java Run Time Environment, which most of the System and Mobile Devices have.
In this project, the user will just need to send the keyword, and all the related documents matching that keyword will be downloaded and sent to the user. The user need not wait for the whole process to finish. He just needs to send the request, and the Multi-Agent System will perform the task for the User.
The main purpose of technology is to ease human work, so that the effort can be put to do more useful work. This project targets that specific purpose, with some added benefits to the user.
V. MULTI-AGENT BASED ARCHITECTURE
The agent system is divided in two parts:
There will be one single agent called Server Agent which serves the requests of multiple users.
Multiple Client Agents which will send a request to the Server Agent in form of a Keyword.
A. Server Agent
This Agent will run autonomously inside the network, which has privileged access, or has been authorized to a service. It will always be ready to accept request from client. Then that keyword will be searched in a Search Engine, and the source code of that web page will be downloaded using Java. That webpage will contain many links, and also some documents. If a link is found while searching the source code of that webpage, then its source code will also be downloaded. This can be visualized in the form of a graph as shown in figure 1. Building this graph will help us not to search for same links again. While parsing the source code of the webpage whenever a link is found, the java code for downloading the source will be called again and executed in
The Eighth International Conference on Computing and Information Technology IC2IT 2012
177
a different thread. Whenever any document is found then the java code for downloading files from web will be called and executed in a different thread. This process may continue forever, so we will restrict the depth of the graph from the starting page. Whenever we are parsing the source code which is at a maximum depth from the starting page then only documents in that web page will be downloaded.
Figure 1. Diagram showing the graph, of the links and documents.
B. Client Agent
This will be a very simple agent, it will perform two tasks.
Authentication of the user, it will send a request to the server with the credentials. If the user is authenticated then, the user will be able to perform his task.
Provide a simple GUI to the user for sending his keyword, relevant to which the user requires documents (or papers). The user can also directly send the link of the document, or IEEE page in this situation the keyword will not be searched in a search engine, but the server agent will perform the next step directly.
In figure 2 interactions between Client Agent, Server Agent and Java Codes in Server has been shown. First step is the authentication process, in which client sends the credential to the server agent for verification. If the credentials are verified then the client will be granted access. The client then sends the search keyword to the server agent, which then verifies if the keyword is valid. If it is valid then, the server agent calls the Java code for downloading source code, which will search all the links starting from the mail search page, and process as shown in figure 3. The source code downloader will send the list of all the documents found back to the server agent. Then the agent will send this list to the Document downloader code, which will download all the documents, and save it in a zipped folder ready to be sent to the client. Then it will notify the server agent that the
downloading has been done, and finally the server agent will send the zipped folder to the client.
Figure 2. Diagram showing the interaction between Server Agent, Client
Agent and Java Codes in Server.
VI. PROTOTYPE DESIGN
Figure 3 represents the main JADE prototype elements. An application based on JADE is made of a set of components called Agents each one having a unique name. Agents execute tasks and interact by exchanging messages. Agents live on top of a Platform that provides them with basic services such as message delivery. A platform is composed of one or more Containers. Containers can be executed on different hosts thus achieving a distributed platform. Each container can contain zero or more agents.
A special container called Main Container exists in the platform. The main container is itself a container and can therefore contain agents, but differs from other containers as
It must be the first container to start in the platform and all other containers register to it at bootstrap time.
It includes two special agents: the AMS that represents the authority in the platform and is the only agent able to perform platform management actions such as starting and killing agents or shutting down the whole platform (normal agents can request such actions to the AMS). The DF that provides the Yellow Pages service where agents can publish the services they provide and find other agents providing the services they need.
Agents can communicate transparently regardless of whether they live in the same container, in different containers (in the same or in different hosts) belonging to the same platform or in different platforms (e.g. A and B). Communication is based on an asynchronous message passing paradigm. Message format is defined by the ACL language defined by FIPA [19], an international organization that issued a set of specifications for agent interoperability. An ACL Message contains a number of fields including
The Eighth International Conference on Computing and Information Technology IC2IT 2012
178
The sender
The receiver(s).
The communicative act (also called performative) that represents the intention of the sender of the message. For instance when an agent sends an INFORM message it wishes the receiver(s) to become aware about a fact (e.g. (INFORM "today it's raining")). When an agent sends a REQUEST message it wishes the receiver(s) to perform an action. FIPA defined 22 communicative acts, each one with a well defined semantics, that ACL gurus assert can cover more than 95% of all possible situations. Fortunately in 99% of the cases we don't need to care about the formal semantics behind Communicative acts and we just use them for their intuitive meaning.
The content i.e. the actual information conveyed by the message (the fact the receiver should become aware of in case of an INFORM message, the action that the receiver is expected to perform in case of a REQUEST message).
In Figure 5, three agents have been shown two clients and one server. Every system that runs a JADE platform will have a main container where all the agents run. The two clients send a request to the server which contains the search query or the link for the paper/document to be downloaded.
Host 3: Server
Host 1: Client
Host 2: Client
Host 3 is the server, where the JADE platform will run, there is only one container called the Main Container where along with the server agent two more agents called AMS and DF will run. Name of the server agent is B@Platform2, and it address is http://host3:7778/acc. When the client agents will communicate with the server agent remotely then host3 must be fully qualified domain name.
There are two clients Host 1 and Host 2, both this will have a JADE platform with one container called the Main Container where along with client agent there will be two more agents called AMS and DF running. When a user wants to send a request to the server agent, the client agent will send a message to the server agent, where the receiver address in this case will be http://host3:7778/acc and the name of the agent will be B@Platform2, along with other necessary details.
Figure 3. Diagram showing two Client Agents, sending message to the
server agent.
VII. CONCLUSION
In this work, we have developed an Agent based system for accessing privileged services in a network remotely. The service in this scenario is subscription to IEEE and ACM sites that do not require authentication being inside the network. We have also automated the process of downloading papers/documents from the web that match the search keyword. This application is the first implementation of this type, so there is a lot of scope of improvement in it. We plan to improve the search and give better results, by considering the semantics of the search keyword. This work addresses one such privileged service, this model can be used as a base and expanded to include a lot more of such services and even provide automation wherever possible.
REFERENCES
[1] Christopher A. Rouff, Michael Hinchey, James Rash, Walter
Truszkowski, and Diana Gordon-Spears (Eds), “Agent Technology from a formal perspective” (Springer-Verlag London Limited 2006).
[2] G. Weiss, Ed., “Multiagent systems: a modern approach to distributed
artificial intelligence”, (MIT Press, 1999).
[3] N. J. Nilsson, “Artificial intelligence: a new synthesis”, (Morgan Kaufmann Publishers Inc., 1998).
[4] S. J. Russell and P. Norvig,” Artificial Intelligence: A Modern Approach”, (Pearson Education, 2003).
[5] M. J. Wooldridge, “Introduction to Multiagent Systems”, (John Wiley
& Sons, Inc., 2001).
[6] A. Idani, “B/UML: Setting in Relation of B Specification and UML Description for Help of External Validation of Formal Development
in B”, Thesis of Doctorat, The Grenoble University, November 2005.
[7] G. W. Brams, “Petri Nets: Theory and Practica”l, Vol. 1-2, (MASSON, Paris, 1982).
[8] M-J. Yoo, “A Componential For Modeling of Cooperative Agents
and Its Validation”, Thesis of Doctorat, The Paris 6 University, 1999.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
179
[9] Jiun-Yan Shiau , Xiangyang Li, “Modeling the supply chain based on
multi-agent conflicts”, Service Operations, Logistics and Informatics, 2009. SOLI '09. IEEE/INFORMS International Conference on,
Publication Year: 2009 , Page(s): 394 – 399
[10] Yan Wang, YinZhang Guo, JianChao Zeng, “A Study of Logistics
System Model Based on Multi-Agent” , Service Operations and Logistics, and Informatics, 2006. SOLI '06. IEEE International
Conference on, Publication Year: 2006 , Page(s): 829 – 832
[11] R. Al-Khannak, B. Bifzer, Hezron, “ Grid computing by using multi agent system technology in distributed power generator” ,
Universities Power Engineering Conference, 2007. UPEC 2007. 42nd International, Publication Year: 2007 , Page(s): 62 – 67
[12] [Online].
Available:http://en.wikipedia.org/wiki/Workforce_management
[13] V. Krishna, V. Ramesh, “Portfolio management using cyberagents”, Systems, Man, and Cybernetics, 1998. 1998 IEEE International
Conference on, Issue Date : 11-14 Oct 1998 Volume : 5 , On page(s): 4860 - 4865 vol.5
[14] Application of Agent Technology to Traffic Simulation. United States
Department of Transportation, May 15, 2007.
[15] A. Hokimoto, K. Kurihara, T. Nakajima, “An approach for constructing mobile applications using service proxies” , Distributed
Computing Systems, 1996., Proceedings of the 16th International Conference on , Issue Date : 27-30 May 1996 , On page(s): 726 - 733
[16] C. Suriyakala, P.E. Sankaranarayanan, “Smart Multiagent Architecture for Congestion Control to Access Remote Energy
Meters” , Issue Date : 13-15 Dec. 2007 , Volume : 4 , On page(s): 24 - 28
[17] A. Mondal, K. Roy, P. Bhattacharya, “Secure and Simplified Access
to Home Appliances using Iris Recognition” , Computational Intelligence in Biometrics: Theory, Algorithms, and Applications,
2009. CIB 2009. IEEE Workshop on, Issue Date : March 30 2009-April 2 2009 , On page(s): 22 – 29
[18] A. Kundu, A.R. Pal, Tanay Sarkar, M. Banerjee, S. Mandal, R.
Dattagupta, D. Mukhopadhyay, “An Alternate Downloading Methodology of Webpages” , Artificial Intelligence, 2008. MICAI
'08. Seventh Mexican International Conference on , Issue Date : 27-31 Oct. 2008 , On page(s): 393 – 398
[19] [Online]. Available : http://www.fipa.org
The Eighth International Conference on Computing and Information Technology IC2IT 2012
180
An Interactive Multi-touch Teaching Innovation
for Preschool Mathematical Skills
Suparawadee Trongtortam aTechnopreneurship and innovation Management
Program
Chulalongkorn University
Bangkok, Thailand
Peraphon Sophatsathit and Achara Chandrachaia
Department of Mathematics and Computer Science,
Faculty of Science
Chulalongkorn University
Bangkok, Thailand
[email protected]/[email protected]
Abstract -The paper proposes a teaching medium that is
suitable for preschool children and teacher to develop
basic mathematical skills. The research applies the bases
of Multi-touch and Multi-point media technologies to
innovate an interactive teaching technique. By utilizing
Multi-touch and the connectivity structure of Multi-point
to create a technology that facilitates simultaneous
interaction from child learners, the teacher can better
adjust and adapt the lessons accordingly. The benefit of
this innovation is the amalgamation of technology and new
idea to supporting teaching media development that
permits teachers and students to interact to each other
directly, as well as self-learning by themselves.
Keywords-Multi-touch; Multi-point; preschool
mathematical skills; interactive teaching technique.
I. INTRODUCTION
Preschool learning is the first step education that
supports child learners in all aspects, e.g., physical,
intellectual, professional, and societal knowledge. One
of the most urgent and important activity to build their
learning is teaching media due to the significant role in
disseminating knowledge, experience, and other skills
to children. There are numerous teaching media for
preschool level, ranging from conventional paper based,
transparencies, audio, video, and computer based media.
The latter is the principal teaching vehicle which has
played an important role owing to its usefulness and
convenience. Children can learn by themselves [1, 2]
and be independent from classroom environment.
This research aims at using the connectivity of
Multi-point technique and Multi-touch approach as the
platform and underlying research process to develop
proper stimulating media for preschool children to learn
basic mathematics. The paper is organized as follows.
Section 2 and 3 briefly explain Multi-touch and Multi-
point technologies. Section 4 describes the proposed
approach, followed by the experiments in Section 5.
The results are summarized in Section 6. Section 7
concludes with the benefits and some final thoughts.
II. MULTI-TOUCH
Multi-touch [3] is a technology that supports several
inputs at the same time to create interaction between the
user and the computer. The system responds to finger
movement as commands issued by the user, e.g., select,
scroll, zoom or expand, etc. Fig. 1 shows multiple
fingers touching on several areas of the screen
simultaneously, thereby mimicking interactive reality of
learning that stimulates high alert attitude.
Figure 1. Multi-touch display and finger movement control
III. MULTI-POINT
Multi-point [4, 5] is a multiple computer connection
structure developed by Windows [6] for educational
institutes or learning centers. It uses one host to support
multi-user interface, permitting simultaneous user’s
responses. The underlying configuration is different
from conventional client-server (C-S) model in that
communication exchange in C-S takes place between
client and server in a pair-wise manner. Any exchange
among clients is implicitly routed through the server.
On the other hand, Multi-point is a simulcast among
peer where everyone can see one another
simultaneously and interactively. This is shown in Fig.
2. The result of such connectivity scheme is less
expenditure, power consumption, easier to manage
which is ideal for classroom environment.
Figure 2. Connectivity of Multi-point scheme
The Eighth International Conference on Computing and Information Technology IC2IT 2012
181
This research applies both technologies by
connecting several teaching aids with the help of
interactive teaching media. The media in turn facilitate
simultaneous teacher’s involvement and children’s
interaction. Teachers can teach and observe the
students, while the students can react to the lesson
promptly. Thus, lessons and practical exercises can be
explained, worked out, and corrected on-the-spot. As
such, the teacher can design the lesson and
accompanying exercises in an unobtrusive and
unbounded by physical means. Conventional preschool
teaching employs Computer Assisted Instruction (CAI)
[7] which provides media in sentences, images,
graphics, charts, graphs, videos, movies, and audio to
present the contents, lessons and exercises in the form
of banal classroom learning. Teaching by CAI can only
create interaction between the learner and the computer.
On the other hand, the proposed approach instigates and
collects responses from several children. The children
collectively learn, collaborate, express individual’s
opinion, and react as they proceed. This in turn
stimulates their interest and thought process for better
understand and knowledge acquisition.
Figure 3. device connection
Fig. 3 shows the inter-connection of electronic
devices for basic preschool mathematics which consists
of a Web server controlled by the teacher to observe
individual child learning. The exercises are designed
and broadcasted via duplex wireless means that allow
the student-teacher to interact back and forth
collectively at the same time.
IV. INTERACTIVE MEDIA TEACHING
Numerous educational media to create learning
lessons are prevalent in this digital age. CAI perhaps is
a predominant technique being adopted in all levels of
teaching. Unfortunately, the-state-of-the-practice falls
short of conveying “effective” teaching that inspires
learning toward knowledge. The limitations of CAI
technology precludes the teacher and students from
interacting to one another simultaneously. Thereby
spontaneous thinking and feedback can never be
motivated and learned systematically. We shall explore
the principal functionality of an interactive teaching
innovation.
Figure 4. Flow of interactive media teaching
Fig. 4 illustrates the flow of media set up for
interactive teaching. We exploit Multi-point principle to
attain higher children’s interaction through latest
electronic devices and Multi-point technology. By
strategically creating exercise in the form of interactive
game to sense the use of multiple fingers touching, their
thought process, while stimulating their interests
through game playing, the teacher can observe the
children’s behavior from their own screen to faster and
easier access and respond to the development of each
child. Thus, they can promptly monitor, instruct, or
sharpen the skill of individual child or the whole group,
without having to repeatedly recite the same instruction
to every child in the conventional classroom setting.
Some of the benefits precipitated from Multi-point
principle are:
1. Instant children and teacher interaction through
easily understood media of instructions. 2. Flexibility of creating or enhancing teaching media
to motivate children’s interests, thereby lessening
learning boredom. 3. Strengthen early childhood skills with the help of
drawing and graphical illustrations. 4. Increase the speed of cognitive learning in children
so as to facilitate subsequent skill development
evaluation. We will elaborate how the proposed scheme works out
in the sections that follow.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
182
A. Teacher Preparation Configuration
Instructional aids are accomplished via our tool
which permits customized display format through
simple set up configurations. The teacher can prepare
her lessons and companion exercises off-line and
upload or post them to the system database. The
children will have access to all the materials once
upload or posted. Any un-posted instructions, lessons,
and exercises will not be accessible by the learners’
display device. The process flow is depicted in Fig. 5
B. Student Learning Process
The process begins with student’s sign-in to identify
himself. He then selects the lesson or exercise set to
work on. All the activities are monitored from the
teacher’s console where the results are made available
instantly. The process flow is depicted in Fig. 6
Figure 5. Preparation process Figure 6. Flow of student
of the teacher learning process
Fig. 5 illustrates the teacher preparation process that
proceeds as follows:
1. Select a topic to prepare the lesson. 2. Add or modify the exercises if the exercises are
already prepared in the early session.
3. Upload/post the materials in the database. In the meantime, the teacher can monitor the
students’ behavior during the lesson as follows:
1. Select the child to be monitored from list.
2. Observe their work.
3. Assess the results to analyze their behavior and
development.
C. Skill test by Bloom’s Taxonomy
Learning evaluation is carried out based on Bloom’s
Taxonomy [8] in the following aspects:
Media skills test
Subject comprehension from doing exercise
Self practice
The evaluation will adopt three basic indicators
given in Table I, namely, Knowledge, Comprehensive
and Application to measure the effectiveness of the
proposed interactive teaching innovation. This is
accomplished via actual preschool class setting by
means of CIPP model to be described in the next
section.
TABLE I. THREE LEVELS OF EVALUATION BY BLOOM’s TAXONOMY.
Level Evaluate
Knowledge
Able to tell the meaning of
positive or negative sign, matching, and shapes.
Comprehension Know how to complete
arithmetic operations
Application do exercise by themselves
D. Learning evaluation by CIPP model
This research makes use of CIPP model [9] to
evaluate the class performance with respect to the
following criteria: score, learning time, degree of
satisfaction, and the ratio of learning per expense. The
evaluation is performed in accordance with the CIPP
capabilities as follows:
Context: all required class materials from course
syllabus are divided into individual topics and subtopics
successively. Each subtopic is further broken down into
stories so that subject contents can be presented. The
corresponding companion exercise are either embedded
or added to the end to furnish as many hands-on drills
as possible.
Input: the above multimedia lessons are measured to
test/monitor the children's skill development,
particularly multi-touch drills. The indirect benefits
precipitated from this design are duration of work and
satisfaction.
Process: a number of evaluations are applied
through Multi-point and Multi-touch technologies. For
example, the time spent on exercise creation and
modification, session evaluation, and cost ratio, etc. In
addition, interactive monitoring, collaboration, and
assistance, instant results display (upon their
availability), and information transfer to/from server,
etc. The savings so obtained are the utmost achievement
of this innovative approach.
Product: the instantaneous interaction between
children and teacher, and the rate of self-learning upon
score improvement, result in tremendous skill
improvement and experience in new technology. Thus,
both score and user's satisfaction improve considerably.
V. EXPERIMENTAL RESULTS
The experiment was run on a Windows-based server
that supports two iPad display devices (to be used by a
preschool class). The proposed approach focused on a
preschool mathematics class, where children learned
basic arithmetic operations through interactive visual
lesson and exercise. Students retrieved their lesson and
corresponding exercises from the Multi-point teaching
media system. As the learning progressed, they
collaboratively worked on the lessons, exercises, and
other activities via the multi-touch system. Their
responses were record interactively (including
The Eighth International Conference on Computing and Information Technology IC2IT 2012
183
corrections, reworks, etc). The results were instantly
processed and made available in the teaching archive.
The process is shown in Fig. 7.
Figure 7. Flow of preschool mathematics exercise
Figure 8. Flow of design, modification, and monitoring exercise
Figure 9. Sample math exercises
Fig. 8 shows the flow of lesson and exercise
creation, modification, and monitoring the students’
activity interactively through the teaching media
system. Individual student’s screen can be selectively
monitored, assisted to correct errors or when help is
needed, and observed and reviewed their performance
via summary on score, frequency of attempts, reworks,
etc. All of which are supported by Multi-point
technique. Fig. 9 illustrates sample mathematics
exercises. We conducted student’s performance and teacher’s
productivity evaluations to measure the
accomplishments of both parties under the proposed
system in comparison with conventional CAI system.
The evaluations measured two instructional media on
the same and different lessons. From the students’
standpoint, the lesson was designed to observe how
students would learn by drawing analogy from the same
lesson and accumulate their skills from different lesson.
From the teacher’s standpoint, this would gauge how
productive the teacher performed on the same and
different lessons.
Several measures were collected and categorized
according to student and teacher, namely, exercise score
(D), duration of work (E), and degree of satisfaction (F),
as shown in Table II, and time spent on creating
exercise (M), time spent on one session evaluation (N),
and ratio of learning per expense (P), as shown in Table
III. For example, the exercise score obtained from the
students learning the same lesson using CAI is 5 out of
10 as oppose to 8 out of 10 problems via Multi-point. In
learning different lessons, the exercise score drops to 1
out of 10 from CAI, but still remains decent at 4 out of
10 problems by Multi-point. Similarly, Multi-point
outperforms CAI by one hour for the time spent on
creating exercise by the teacher in both cases. The same
outcomes hold true for learning per expense where more
teachers agree on the effectiveness of Multi-point than
CAI approach. The corresponding plots are depicted in
Fig. 10-13, respectively.
TABLE II. STUDENT PERFORMANCE EVALUATION
Detail Same Lesson Different Lessons
CAI Multi-Point CAI Multi-Point
D 5/10 8/10 1/10 4/10
E 20 min 13 min 45 min 30 min
F 9/15 12/15 5/15 9/15
Figure 10. Students’ performance Figure 11. Students’ performance
on the same lesson on different lessons
Table III. TEACHER PRODUCTIVITY EVALUATION
Detail Same Lesson Different Lessons
CAI Multi-Point CAI Multi-Point
M 4 hr. 3 hr. 4 hr. 3 hr.
N 60 min 20 min 85 min 25 min
P 7/15 13/15 4/15 9/15
Figure 12. Teacher’s performance Figure 13. Teacher’s performance
on the same lesson on different lessons
From the overall comparative evaluation, it is
apparent that the use of Multi-point and Multi-touch
technologies is more effective than the conventional
CAI approach from both student and teacher’s
The Eighth International Conference on Computing and Information Technology IC2IT 2012
184
standpoint. The obvious initial investment is fully offset
by better score, less time, higher satisfaction on the
student’s part, and more production and cost effective
on the teacher’s part. The percentage of agreeable
opinion on electronic media adoption is illustrated in
Fig. 14.
Figure 14 Percentage of electronic teaching media adoption
VI. CONCLUSION
We have proposed an interactive teaching
innovation for preschool children to improve their
mathematical skills. The contributions are two folds, (1)
the teacher can instruct and monitor preschool
children’s development in real-time, promptly obtaining
class evaluation, delivering lessons, and become more
economically productive over conventional CAI
approach; and (2) preschool children can improve their
mathematical skills, or knowledge in general, by
interactive means. They will become more enthusiastic
to explore new ideas, express themselves, and gain
confident and self-esteem as they progress. The
proposed approach is simple and straightforward to
realize. The underlying configuration exploits Multi-
point to simultaneously connect students with the
teacher, while interactively furnishes spontaneous
communications among them. In the meantime, students
can collaboratively work on the exercise to enhance
their learning skill via Multi-touch technology. The
resulting amalgamation is an innovative scheme which
is subsequently implemented as a teaching tool.
We targeted at developing their mathematical skills
to gauge how the overall configuration will work out.
The comparative summaries with conventional CAI
turned out to be superior and satisfactory in many
regards.
We envision that the proposed system can be further
extended to operate on larger network scale, whereby
wider student audience can be reached.
ACKNOWLEDGMENTS
We would like to express our special appreciation to
teachers and students of Samsen Nok School and
Phacharard Kindergarten School for their courteous
cooperation and invaluable time for this research.
REFERENCES
[1] National Education Act, 2542 No 116 At 74a
Rajchakichanubaksa, 19 August 1999.
[2] Division of Academic and Education Standards,
Office of the Elementary Education Commission,
MoE. 2546 B.E. Handbook of Preschool Education
Age 3-5 years, Ministry of Education, 2546 B.E.
[3] Wisit Wongvilai Software Technology of The
Future. [online], 2008, http://www.nectec.or.th, [8
July 2010].
[4] Suphada Jaidee. Electronic Learning Media.
[Online], 2007,
http://www.microsoft.com/thailand/press/
nov07/partners-in-learning.aspx, [12 July 2010].
[5] Pedro González Villanueva, Ricardo Tesoriero,
Jose A. Gallud, ”Multi-pointer and collaborative
system for mobile devices” , Proceedings of the
12th international conference on Human computer
interaction with mobile devices and services, pp
435-438, 2010.
[6] Windows. MultiPoint Server 2011. [Online],
http://www.microsoft.com/thailand/windows/multi
point/default.aspx, [12 August 2011].
[7] Donald L. Kalmey, Marino J. Niccolai, “A Model
For A CAI Learning System” , ACM SIGCSE
Bulletin Proceedings of the 12th SIGCSE
symposium on Computer science education, vol.
13, Issue 1, pp. 74-77, February 1981.
[8] Bloom, B.S et al. Taxonomy of Education
Objectives Classification of Education Goals,
Handbook I: Cognitive Domain, New York: David
Macky, 1972.
[9] Stufflebeam, D. L., The CIPP Model for program
evaluation. In Maduas, G. F., Scriven, M., &
Stufflebeam, D.L. Evaluation Model: Viewpoints
on Human Services Evaluation. Boston: Kluwer-
Nijhoff Publications, 1989.
The Eighth International Conference on Computing and Information Technology IC2IT 2012
185
AUTHOR INDEX
Pages Agarwal, Navin 176 Aditya, Narayan Hati 116 Acharya, Sudipta 169 Bernard, Thibault 14 Bui, Alain 14 Chandrachai, Achara 181 Chaiwongsa, Punyaphat 145 Fung, Chun Che 24, 42 Chongstitwattana, Prabhas 133 Chen, Ting-Yu 30 Dutta, Animesh 169, 176, 138 Upadhyay, Prajna Devi 169, 138 Tran, Hung Dang 75 Smith, Derek H. 127 Hunt, Francis 127 Grachangpun, Rugpong 70 Getta, Janusz 121 Ghosh, Supriyo 138 Haruechaiyasak, Choochart 58, 70 Hiransakolwong, Nualsawat 81 Johannes, Fliege 98 Sil, Jaya 116 Jana, Nanda Dulal 116 Kamsiang, Nawarat 163 Kajornrit, Jesada 24, 42 Wong, Kok Wai 24 Keeratiwintakorn, Phongsak 19 Kongsakun, Kanokwan 42 Kubek, Mario 104 Leelawatcharamas, Tunyathorn 48 Le, Pham Thi Anh 157, 75 Li, Yuefeng 92 Lin, Yung-Chang 92 Minh, Quang Nguyen 75, 157 Muchalintamolee, Nuttida 151 Dewan, Mohammed 109 Quaddus, Mohammed 109 Mehta, Kinjal 87 Minh, Quang Nguyen 157, 75 Salani, Matteo 127 Mingkhwan, Anirach 64 Mandal, Sayantan 116 Meesad, Phayung 36 Montemanni, Roberto 127 Nitsuwat, Supot 54 Nakmaetee, Narisara 58
The Eighth International Conference on Computing and Information Technology IC2IT 2012
186
Pages Nhan, Le Thanh 157 Ouedraogo, Boukary 14 Paoin, Wansa 54 Pattaranantakul, Montida 8 Sangwongngam, Paramin 8 Sangsongfa, Adisak 36 Senivongse, Twittie 48, 151, 163 Sheth, Ravi 87 Sodanil, Maleerat 58, 70 Sophatsathit, Peraphon 187 Sripimanwat, Keattisak 8 Tansriwong, Kitipong 19 Trongtortam, Suparawadee 181 Upadhyay, Prajna Devi 138, 169 Unger, Herwig 104 Waijanya, Sajjaporn 64 Wolfgang, Benn 98 Wu, Ming-Che 30 Wu, Sheng-Tang 92 Yampaka, Tongjai 133 Yawai, Wiyada 81 Zimniak, Marcin 98
The Eighth International Conference on Computing and Information Technology IC2IT 2012
187
The 9th
International Conference on
Computing and Information Technology
10-11 May 2013
At Faculty of Information Technology
King Mongkut’s University of Technology
North Bangkok, Thailand
www.ic2it.org
Faculty of Information Technology
King Mongkut’s University of
Technology North Bangkok
www.it.kmutnb.ac.th