Interoperability in Business Information SystemsIBIS – Interoperability in Business Information...

IBIS – Interoperability in Business Information Systems

ISSN: 1862-6378

Issue 1 (1) 2006

IBIS

International Journal of

Interoperability in Business

Information Systems

http://www.ibis-journal.net ISSN:1862-6378

-2- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Publisher: University of Oldenburg Department of Business Information Systems D-26111 Oldenburg Germany Tel.: +49 441 798 4481 Fax: +49 441 798 4472 [email protected] http://www.ibis-journal.net ISSN: 1862-6378 License: All our articles are published under the Digital Peer Publishing License. This ensures free distribution and it also guarantees the authors' rights that no content is modified or reused without citing the names of authors and holders of rights and the bibliographical information used. Additionally, the rights to use in physical form, particularly the rights to distribute the work in printed form or on storage media, are retained by the authors or other rights holders and are not covered by this license. The full license text may be found at http://www.ibis-journal.net Scope: The capability to efficiently interact, collaborate and exchange information with business partners and within a company is one of the most important challenges of each enterprise, especially forced by the global markets and the resulting competition. Today, many software systems are completely isolated and not integrated into a homogeneous structure. This makes it hard to exchange information and to keep business information in sync. Interoperability can be defined as the ability of enterprise software and applications to interact. In Europe between 30-40% of total IT budgets is spent on issues tied to Interoperability. This journal aims in exchanging and presenting research activities in the area of creating interoperability in business information systems. Ambition of this journal is to get an overview of current research activities as well as to offer a broad discussion in selected areas of the interoperability of heterogeneous information systems. It is proposed to connect research experts from this domain and to exchange ideas and approaches. It is our goal to connect latest research results with real-world scenarios in order to increase interoperability in business information systems.


-3-© IBIS – Issue 1 (1), 2006

Review Board: Submitted articles of this journal are reviewed by the members of the IBIS review board. At the date of this issue, the review board contains the following persons in alphabetical order:

Sven Abels, University of Oldenburg, Germany Antonia Albani, Delft University of Technology, Netherlands

Bernhard Bauer, University of Augsburg, Germany

Arne-J. Berre, SINTEF, Distributed Information Systems, Norway

Jean Bézivin, University of Nantes, Brittany, France

Flavio Bonfatti, University of Modena, Italy

Frank-Dieter Dorloff, University of Duisburg-Essen, Germany

Michael Goedicke, University of Duisburg-Essen, Germany

Jan Goossenaerts, University of Eindhoven, Netherlands

Norbert Gronau, University of Potsdam, Germany

Axel Hahn, University of Oldenburg, Germany

Wilhelm Hasselbring, University of Oldenburg, Germany

Martin Hepp, Digital Enterprise Research Institute (DERI), Germany

Willem-Jan van den Heuvel, Tilburg University, Netherlands

Paul Johannesson, Royal Institute of Technology, Sweden

Michele Missikoff, IASI, National Research Council (CNR), Italy

Andreas Lothe Opdahl University of Bergen, Norway

Mathias Uslar, OFFIS e.V., Germany

Martin Zelm, CIMOSA Association e.V., Germany

The editors therefore thank all reviewers for their help. This IBIS issue wouldn’t be possible without the continuous support of the reviewers.


-4- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006


-5-© IBIS – Issue 1 (1), 2006

Content Editorial ................................................................................................... 7 Maturity Assessment Framework for Business Dimension of Software Product Family ....... 9 Formulation Schema Matching Problem for Combinatorial Optimization Problem ........... 33 The Connection, Communication, Consolidation, Collaboration Interoperability Framework (C4IF) For Information Systems Interoperability.................................................... 61 Ontology Mapping for Web-Based Educational Systems Interoperability....................... 73 Call for Articles .......................................................................................... 85


-6- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006


-7-© IBIS – Issue 1 (1), 2006

Editorial Dear reader, we would like to welcome you to the first issue of the international journal of Interoperability in Business Information Systems (IBIS). The capability to efficiently interact, collaborate and exchange information with business partners and within a company is one of the most important challenges of each enterprise, especially forced by the global markets and the resulting competition. Today, many software systems are completely isolated and not integrated into a homogeneous structure. This makes it hard to exchange information and to keep business information in sync. Interoperability can be defined as the ability of enterprise software and applications to interact. In Europe between 30%-40% of total IT budgets is spent on issues tied to Interoperability. This journal aims in exchanging and presenting research activities in the area of creating interoperability in business information systems. This journal aims in exchanging and presenting research activities in the area of creating interoperability in business information systems. Ambition of this journal is to get an overview of current research activities as well as to offer a broad discussion in selected areas of the interoperability of heterogeneous information systems. It is proposed to connect research experts from this domain and to exchange ideas and approaches. It is our goal to connect latest research results with real-world scenarios in order to increase interoperability in business information systems. IBIS does not want to be in competition with existing journals. It rather wants to offer a way of quickly publishing high quality research results. The possibility to quickly react to results is possibly the most important advantage of the online-journal. It is our purpose to use this journal as a catalyst for fostering the collaboration among researchers of the interoperability domain without depending on their location or language. As far as we know, IBIS is the first online-journal dealing exclusively with the interoperability of business information systems. The IBIS journal aims at distributing latest research results of the interoperability domain free of charge. Hence, all accepted issues of the journal can be downloaded for free at the journal website. All accepted articles are peer reviewed by at least three independent members of the review board. In order to assure an excellent quality, the journal only accepts highly rated articles. Accepted submissions are published in the journal and citable. The journal has got an international ISSN number allowing a unique identification of papers. The International Journal of Interoperability in Business Information Systems releases several issues a year.


-8- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

In the first issue, we are proud to present you a number of interesting publications that focuses different areas of the interoperability problem. The first issue contains four accepted submissions. Since the success of IBIS depends on the reader’s interests as well as on the number of high quality submissions, we would be glad to receive recommendations or comments in order to improve the journal. If you want to directly contact an author of an article, then you may send him an email or you may also contact the IBIS management board, so we can forward your comments. We hope that you will appreciate the journal and the articles, selected for this issue. Best regards, your editors. Axel Hahn, Sven Abels & Mathias Uslar


-9-© IBIS – Issue 1 (1), 2006

Maturity Assessment Framework for Business Dimension of Software Product Family

Faheem Ahmed, Luiz Fernando Capretz

Department of Electrical & Computer Engineering University of Western Ontario, London Ontario, Canada, N6A 5B9

[email protected], [email protected]

Abstract: The software product family approach aims at curtailing the concept of “reinventing the wheel” in the software development process. The business has been highlighted as one of the critical dimensions in the process of software product family. This work presents an assessment framework for evaluating the business dimension of software product family process. Additionally, a software product family business evaluation tool has been designed and implemented on the basis of the presented framework. The tool preprocesses the data of key business factors, and it evaluates the overall business maturity of an organization. To demonstrate the application of the framework, and to determine the current software product family business performance, we conducted a case study of an organization actively involved in the business of software product family. The framework and the tool provide direct mechanisms to evaluate the current maturity level of software product family business of an organization. This research is a contribution towards establishing a comprehensive and unified strategy for a process evaluation of the software product family.

Introduction

The software product family has become one of the most promising practices with the potential to substantially increase the productivity of software development process. It has emerged as an attractive phenomenon within organizations dealing with software development. Software product family is a collection of software systems built from a common underlying architecture and a set of software assets in order to address the needs of a particular market segment. There are other corresponding terminologies for software product family, ones, which have been widely used in Europe and North America: for example, “product population”, “system families”, and “software product line”. Ommering [1] introduced the term “product population”, which is a collection of related systems based on similar technology but having many differences among them. The software product line is a comprehensive model for an organization building applications that are based on a common architecture and core assets [2]. Clements [3] defines the term “software product line” as a set of software systems sharing a common, managed set of features that satisfy the specific needs of a particular market segment, and that are developed from a common set of core assets in a prescribed way. The economic potentials of software product line have long been recognized in the software industry [4][5]. Clement et al. [6] reported that software product line engineering is a growing software engineering sub-discipline, and many organizations, including Philips, Hewlett-Packard, Nokia, Raytheon, and Cummins, are using it to achieve extraordinary gains in productivity, time to market, and product quality.


-10- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

In today’s digitized economy, every organization strives to capture a major portion of the market segment in order to sustain profitable business. Many global organizations dealing in wide areas of operations such as consumer electronics, telecommunication, avionics, and information technology, perceive the software product family as being the future of software development in order to achieve cost reduction, short development time, and improved quality. The business of the software product family requires improvements over time in order to maintain an advantage over competitors. It is very difficult to organize an efficient and effective improvement plan unless it is based on the results of a comprehensive assessment exercise. Business assessment determines the current status of the business maturity of an organization, and it identifies the areas that need improvements. This work presents the business maturity assessment framework for the organizations dealing with software product family practice.

Related Work: Process Maturity Evaluation of Software Product

Family

Software product family process assessment is a relatively a new area of research where not much work has been done. Currently, researchers from both academe and industry are working to develop a prescribed and systematic way of measuring the maturity of a software product family process. Jones and Soule [7] discuss the relationships between software product line process and the Capability Maturity Model Integration (CMMI). They observe that the software engineering process areas specified in CMMI provide an important foundation for software product line practice. They compare the software engineering process areas of the software product line and CMMI and find some similarities, but conclude that there is still a need to establish a comprehensive strategy for process assessment of the software product line. The Software Engineering Institute (SEI) proposed the Product Line Technical Probe (PLTP)[8]. The objective of PLTP is to discover the ability of an organization to adapt and succeed with the software product line approach. PLTP is based on the framework for software product line practice proposed at SEI, and it divides the overall engineering activities of software product line engineering into a set of three categories: product development, core assets development and management. However, PLTP does not set forth any methodology to evaluate the maturity of the software product line process. Ahmed and Capretz [9] propose a set of rules for developing and managing a software product line within an organization. On the basis of the proposed rules, a fuzzy logic-based software product line process assessment tool was designed and implemented. The tool provides an opportunity to evaluate the maturity of the software product line process within an organization. A number of case studies were conducted on the industrial software process data from reputable software development organizations. The results of the study were compared with the existing CMMI levels of the organizations in order to compare the assessment produced by two different approaches. One of the conclusions of their work also suggests that there is still a need to establish a unified and comprehensive strategy for process assessment of the software product line.


-11-© IBIS – Issue 1 (1), 2006

In Europe, the acronym BAPO [5] (Business-Architecture-Process-Organization) is very popular for defining process concerns associated with software product family, as illustrated in Figure 1. The “Business” in BAPO is considered critical because it deals with the way the products resulting from software product family make profits. van der Linden et al. [10] propose a four dimensional software product family maturity evaluation framework primarily based on the BAPO concept of operations. This provides an early foundation for a systematic and a comprehensive strategy for process maturity evaluation of software product family. Figure 2 illustrates the conceptual layout of this approach. The four dimensions of the framework are: business, architecture, process and organization. van der Linden et al. [10] identifies maturity scales of up to five levels in ascending order for each dimension of BAPO, as illustrated in Table-1. In the case of software product family, this results in separate values for each of the four dimensions. However, the conceptual model of software product family maturity evaluation, shown in Figure 2 does not address a number of key steps involved (shown with dashed rectangles) including: The definition of maturity scale for overall software product family process. The frameworks to evaluate the four dimensions of business, architecture,

process, and organization. The methodology to evaluate the overall maturity profile of an organization

once the assessment results of individual dimensions, such as business, architecture, process and organization, have been obtained. The circle with cross (in Figure 1) represents this stage.

Figure 1: Business- Architecture- Process- Organization Concept of Operations of Software

Product Family


-12- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Business Architecture Level 1 Reactive Independent Product Development Level 2 Awareness Standardized Infrastructure Level 3 Extrapolate Software Platform Level 4 Proactive Software Product Family Level 5 Strategic Configurable Product Base

Process Organization Level 1 Initial Unit Oriented Level 2 Managed Business Lines Oriented Level 3 Defined Business Group/Division Level 4 Quantitatively Managed Inter Division/Companies Level 5 Optimizing Open Business

Table 1: Maturity Levels of Four Dimensions in BAPO Model The main contribution of the research presented in this paper is to put forward a maturity assessment framework for measuring the business dimension of software product family, where no work has been done yet to the best of our knowledge. The gray shaded rectangle in Figure 1 clearly highlights the scope of this work within the conceptual layout of software product family maturity assessment. This work is one of the steps in the BAPO-based framework of software product family maturity assessment. This research contributes towards establishing a comprehensive and unified strategy for process maturity assessment of software product family.

Figure 2: Software Product Family Process Maturity Assessment Approach


-13-© IBIS – Issue 1 (1), 2006

Related Work: Software Product Family and Business Dimension

At Fraunhofer Institute of Experimental Software Engineering (IESE) Bayer et.al [11] develop a methodology for the purpose of enabling the conception and deployment of software product family within a large variety of enterprise contexts, called PuLSE (Product Line Software Engineering). Pulse-Eco is a part of Pulse methodology that deals with defining the scope of software product family in terms of business factors. Pulse-Eco identifies various activities that directly address the business needs of software product family, needs such as system information, stakeholder information, business objectives and benefit analysis. van der Linden et al. [10] identify some main factors in evaluating the business dimension of software product family, factors such as identity, vision, objectives, and strategic planning. Clements and Northrop [8] highlight customer interface management, market analysis, funding, and business case engineering as important activities from the perspectives of organizational management. Kang et al. [12] present a marketing plan for software product family that includes market analysis and marketing strategy. The market analysis covers need analysis, user profiling, business opportunity, time to market and product pricing. The marketing strategy discusses product delivery methods. Toft et al. [13] propose “Owen molecule model” which consists of three dimensions: social, technology and business. The business dimension deals with setting up business goals and analyzing commercial environment. Fritsch and Hahn [14] introduce Product Line Potential Analysis (PLPA) which aims at examining the product line potential of a business unit through discussions with managers of the business unit because, in their opinion, business managers know the market requirements, product information, and business goals of the organization. Schmid and Verlage [15] discuss the successful case study of setting up software product family at Market Maker, and they highlight some significantly important activities such as market and competitor analysis, and a vision of potential market segment and products from business aspects of software product family process. Ebert and Smouts [16] weight marketing as one of the major external success factors of product line approach and further conclude that forecasting, the ways to influence market, a strong coordination between marketing and engineering activities are required for gaining benefits from product line approach. The summary of the related work presented in this sub-section highlights some key business factors such as strategic planning, innovation, market orientation, business vision, order of entry, and customer orientation. We used these key business factors as the basis of the framework presented in this paper to evaluate the business maturity of software product family of an organization.

The Business Dimension of Software Product Family

Business is perhaps the most crucial dimension in the software product family process, mainly due to the necessities of long-term strategic planning, initial investment, longer payback period, and retention of the market presence. Business assessment is an essential activity for improving the overall software product family process because it provides in-depth information about the status of the business. The business requires improvements over time, mainly due to external


-14- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

and internal forces of change. It is very difficult to develop an efficient and effective improvement plan unless it is based on the results of a comprehensive assessment exercise. Business assessment determines the current status of the business performance of an organization and identifies the areas that require improvement. A comprehensive methodology is proposed in this paper for the business assessment of organizations dealing with software product family. The business process consists of certain set of activities to cover various aspects of the business. In this paper we termed those sets of activities as “key business factors”, and used them to evaluate the business maturity of an organization. These key factors, which constitute the overall business strategy and the operations of the organization, largely determine the success or failure of the business endeavors of an organization. The key business factors used in this framework are market orientation, strategic planning, order of entry, brand name strategy, innovation, relationships management, assets management, business vision and financial management. The choice of using these key business factors in this study in order to evaluate the business maturity of an organization is based on the literature survey of research in software engineering, software product family, business, organization and technical management. Short descriptions of these key business factors, along with their aspects related to software product family, are provided in the next sub-sections.

Market Orientation

Market orientation deals with the acquisition, sharing, interpretation and utilization of information about customers and competitors. According to Kohli and Jaworski [17] in market orientation the organization collects market intelligence about the current and future needs of customers, and, disseminates this intelligence across various entities within an organization for decision-making purpose. The software product family deals with developing a considerable number of products to capture a share in the market. Market orientation provides essential information about the concerns and requirements of customers, information which needs to be accommodated in the successive products from a product line. Customer orientation enables an organization to develop customer-centered products. This information assists in the domain and application engineering activities of the software product family process. Information regarding the competitors is used to exploit product functionalities in order to attract new customers. The orientation of customers and competitors determines the schedules for the delivery of software products into the market at an appropriate time. Table 2 illustrates the market orientation assessment questions that are part of the software product family business assessment framework. They are designed to receive feedback from organizations in order to evaluate how effective is their market orientation.


-15-© IBIS – Issue 1 (1), 2006

1 Does the organization use feedback from customers to improve the quality of products and

services? 2 Does the organization use feedback from customers to develop new products or services? 3 Does the organization have adequate knowledge about its customers and competitors? 4 In making decisions about new products, does the organization give consideration to the

complaints and issues of its customers? 5 Does the organization have adequate resources and skills to gather information about the

market? 6 Has the organization established a defined inter-communication protocol among external and

internal entities for the dissemination of market intelligence? 7 Does the organization successfully respond to the actions of competitors and is it able to

decrease the number of competitors over a period of time? 8 Does the organization regularly collect and analyze data from the consumer market to

identify opportunities for new market segments? 9 While engaging in strategic market planning, does the organization explicitly consider

competitors as its top priority? 10 Is the organization able to increase its targeted market size over time?

Table 2: Market Orientation Assessment Questionnaire

1 Does the organization have fast and accurate means to access the required information in order to facilitate responses to the queries of customers about different products and services?

2 Does the organization have a well-established system to quickly extract, manipulate and produce data for profitability analysis, customer profiling, and retention modeling?

3 Does the organization attract new and existing customers through personalized communication and innovative targeting methods?

4 Does the organization have an established promotions strategy to attract new customers and retain existing ones?

5 Does the organization simplify its business processes regularly in order to enhance the experience and satisfaction of customers?

6 Is the organization able to retain its customers over a long period of time? 7 Do the competitors perceive the software product family of the organization as a direct threat

to their business? 8 Is the software product family able to respond quickly to actions of the competitors? 9 Regarding customers and competitors, has the organization established efficient resources for

market intelligence? 10 Has the organization established a balance in customer and product-centered approaches in

product development? Table 3: Relationships Management Assessment Questionnaire

Relationships Management

Wilson [18] observes that relationships management is concerned with the development and maintenance of close, long-term, and mutually beneficial and satisfying relationships between individuals or organizations. Crosby et al. [19] considers relationships management as the extent to which parties have the orientation or behavioral tendency to actively cultivate and maintain close working relationships. Relationships management plays a significant role in successful software product family business. Software products generally require assistance from the seller to successfully install and train the customers so that they can use the product effectively. An excellent customer support service enhances the satisfaction of the customers with the product. Customer profiling suggests new features in successive products from the software product family. Promotional strategies like incentives in purchasing new products further increase the sales and


-16- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

provide the justification of the product family infrastructure. Table 3 illustrates assessment questions of relationships management that are part of the software product family business assessment framework. This assessment questionnaire is designed to measure the effectiveness of the relationships management of an organization.

Order of Entry

There are three observable categories in a firm’s order of entry in the market: pioneers, early followers, and late movers [20] [21]. The benefits of being the first in the market have long been recognized in the business sector. The pioneers can gain a sustainable competitive advantage over followers because, initially, they are the only solution providers in the market. The appropriate timing of technology-based products to enter into the market is critical in capturing big share in market. The timing to launch a software product into the market is even more essential for software development organizations. The software product family produces successive products having controlled variability and commonality. The new products from the software product family share a common architecture and essentially have features common to their predecessors. In order to capture major shares of the market, timing is essential in launching a new product from the software product family. The order of entry into the market depicts the delivery schedule for the software product family and provides guidelines to the developers about development schedules. Table 4 illustrates order of entry assessment questions that are part of the software product family business assessment framework.

1 Do the products developed from the software product family enter into the market at the appropriate time?

2 Does the organization have the potential of being first in the market? 3 Is the organization regarded as a pioneer in product development or is it perceived as

follower? 4 Does the software product family allow the organization to take advantage of being first in the

market? 5 Do the products that develop from software product family are in response to actions of

competitors? 6 Is the software product family able to increase the market presence of the organization? 7 Do the successive products of the software product family help in retaining current customers

and have the tendency to attract new customers? 8 Is the software product family able to meet the demands of the delivery schedule of the

customers? 9 Does the organization regularly conduct market reviews and update the development and

delivery schedule of the software product family, keeping in view the market trends and needs?

10 Are the customers satisfied with the timing of a new product launch? Table 4: Order of Entry Assessment Questionnaire


-17-© IBIS – Issue 1 (1), 2006

Brand Name Strategy

Organizations consider brand name as a crucial catalyst of business success. A brand is regarded as both a promise of quality to customers and a point of comparison with other products or services. Bennett [22] defines brand as a name, term, sign, symbol, design, or any combination of these concepts, used to identify the goods and services of a seller. Brand name products generally have a higher potential in increasing the business of an organization. Bergstrom [23] observes that in the proliferation of competitors and products that are easily duplicated or replaceable, brands become an important means of simplifying the decision-making process for buyers or users. Software product family business is even more inclined towards a brand name strategy, because it envisages the business growing with a stream of products having commonality and variability among them. The brand name strategy in the software product family has a twofold advantages. First, it expands the market for profitable business, and, secondly it acts as a guide for new business cases, which serve as an extension of current products. Table 5 illustrates brand name strategy assessment questions that are part of the software product family business assessment framework and are designed to get feedback from organizations in order to evaluate how effective is the brand name strategy of the organization.

1 Is the organization involved in a direct or indirect brand name strategy of the software product family?

2 How is the software product family of the organization unique or different from the products of other competitors?

3 Are the new products from the software product family consistent with the current brand extension?

4 Does the organization continuously monitor the performance of the brand in the market? 5 Is the brand of software product family aligned with the strategic plans of the organization? 6 Are the new products from the software product family attracting the customers, and are they

considered as an extension or even an improved version of the predecessor? 7 How important does the organization considers brand name strategy for the software product

family? 8 Does the business vision of the organization foresee a brand name for the software product family? 9 Is the software product family in direct one-to-one competition with the competitors in the market? 10 Are the decisions of the customers influenced by the brand name of the software product family?

Table 5: Brand Name Strategy Assessment Questionnaire

1 Does the organization have a well-documented business vision statement? 2 Is the business vision of the organization communicated within to all members of the

organization? 3 Does the business vision statement clearly state where the organization is going in the future? 4 Is the software product family a part of the business vision of the organization? 5 Is the business vision statement regularly reviewed, and updated? 6 Do the employees understand the importance of the software product family in the business

vision and feel that the organization can realistically achieve its targets? 7 Does the software product family play a significant role in the business vision of the

organization? 8 Is the software product family development essential for the organization to reach future goals? 9 Does the business vision of the software product family aim at retaining current customers and

attracting future ones? 10 Does the software product family play a major role in achieving future financial goals?

Table 6: Business Vision Assessment Questionnaire


-18- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Business Vision

In practice, business vision is a statement that is prepared by top management and communicated to all members of the organization. The statement includes the identification of a desired future, and a well-established connection between the future and the present state. A successful business vision plan requires all the employees within the organization to participate and to clearly understand the vision statement. The business vision describes the commitment of the organization achieving a goal. The software product family plays a significant role in the business vision because it tends to produce long-term benefits to the organization. The software product family is a part of strategic assets of an organization, which can be mobilized to establish a connection between the present and future goals. The importance of the software product family in an organization requires answering two questions: how the organization fits the software product family in the business vision and also, how important the software product family is in its’ future plans. Table 6 illustrates business vision assessment questions that are part of the software product family business assessment framework. This assessment is designed to receive feedback from organizations in order to evaluate the importance of software product family in business vision of an organization.

1 Does the organizational strategic planning give the software product family as an important consideration?

2 Is the software product family aligned with the strategic plans of the organization? 3 Does the strategic planning allocate sufficient resources for software product family

development? 4 Do the strategic plans highlight an evolution in the software product family under changing

business conditions? 5 Does the software product family play a significant role in achieving the strategic objectives of

the organization? 6 Do the strategic plans define how an organization will achieve the technological capability to

successfully adopt the concept of the software product family development? 7 Does the strategic planning identify key market segments for the software product family

business? 8 Does the management have strategic plans about the order of entry of software products into

the market? 9 Do the strategic plans envision new products from the software product family? 10 Do the strategic plans create a roadmap aligned with the business vision of the organization?

Table 7: Strategic Planning Assessment Questionnaire

Strategic Planning

A strategic plan of an organization specifies a set of activities performed to accomplish the desired level of achievement in a particular area. Strategic planning starts with elaborating strategic objectives. Harrison [24] asserts that objectives indicate what management expects to accomplish, whereas planning sets forth how, when, where and by whom the objectives will be attained. Strategic planning is a continuous process within an organization. It determines business goals, evaluates the obstacles, and defines approaches to deal with those


-19-© IBIS – Issue 1 (1), 2006

obstacles. It outlines definite tasks for individuals, groups, and for the entire organization, tasks which are needed to accomplish these goals. In order to set clear objectives, and align organizational resources to match opportunities and counter threats, software product family development requires consideration in the strategic planning of the organization. The future directions of the business must accommodate the software product family as an integral asset. The software product family process needs resources that must be delegated in strategic plans. In order to gain competitive advantages, capture market segments, and achieve strategic targets, strategic planning must clearly outline what is to be developed from the software product family. This planning ensures that decisions made to allocate and commit resources reflect the relative significance of the software product family in achieving the long-range business goals. Table 7 illustrates strategic planning assessment questions that are included as part of software product family business assessment framework. This questionnaire is designed to get information about the maturity of the strategic planning of an organization dealing with software product family.

Assets Management

Assets management outlines action plans for the creation, acquisition, maintenance, replacement and disposal of assets to provide an agreed-upon level of cost-effective and sustainable development. The assets management has a direct impact on the performance and success of the business. Chen [25] concludes that assets management of computing resources is a process that helps in managing hardware/software procurement, usage, and update and it tracks inventory, enables change, and improves overall efficiency in software development. The notion of the software product family is conceptually aligned with assets management. The software assets repository establishes a production capability for the software product family. A strategic goal of assets management in the software product family is the optimal use of computing resources during product development. Assets management for the software product family process provides a way of managing the infrastructure, and understanding the production needs of the software development. The observable fact of reusability in the software product family development process advocates that software assets management gain benefits while developing a family of similar products. The questionnaire shown in Table 8 illustrates assets management assessment questions. They are part of the software product family business assessment framework.


-20- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

1 Does the organization have a defined policy of managing assets for the software product family?

2 Is the information about core assets well communicated to all personnel involved in development related activities?

3 Are the assets of the software product family dynamic, and do they continuously grow as the production proceeds?

4 Are all the assets in the repository consistent with the scope of the software product family?

5 Does the organization maintain information about assets, as well as versions and utilization history during product development?

6 Is the assets management of the organization aligned with the strategic planning? 7 Have the software assets significantly reduced the development cycle of the software

product family? 8 Are the software assets consistent with the production constraints and the production plan

of the software product family? 9 Does the software assets management activity satisfy the cost-to-benefits ratio for the

organization? 10 Has the organization allocated sufficient resources for managing software assets?

Table 8: Assets Management Assessment Questionnaire

Innovation

One of the keys to a successful business in today’s competitive environment is innovation. Organizations are continuously adopting innovations in major areas of business operations such as technology, administration and production process. Innovation is regarded as a by-product of research and development. Martensen and Dahlgaard [26] conclude that innovation should be closely linked to the vision of the company and its overall business strategy. Innovation and continuous improvements in processes and products illustrate the capability of the organization to be creative and to be pioneers in product development. The success of the software product family is largely dependent on innovative ways of identifying potential business cases. Business cases that offer additional features with innovative ideas embedded in them have a greater potential of success in capturing the attention of new and existing customers. Software product family development not only requires research and development to enhance the process methodology and the industrialization of this concept, but it also needs innovative measures for selecting, developing and launching business cases. New ideas in market orientation and in relationships management are the true goals of the software product family in capturing a major market share. The questionnaire shown in Table 9 illustrates innovation assessment questions that form part of the software product family business assessment framework.


-21-© IBIS – Issue 1 (1), 2006

1 Has the organization defined a road map for research and development in software product

family? 2 Does the organization successfully employ innovations in the software product family

development? 3 Does the organizational culture support innovation in the software product family? 4 Does the organization use any specific guidelines or process model that represent the macro

elements of the software product family innovation process? 5 Do the employees have opportunities to participate in problem solving and idea generation

activities for the software product family? 6 Are the innovations in the software product family aligned with the existing business goals? 7 Does the management support reactive and proactive innovations in the software product

family process? 8 Does the organization allocate sufficient resources to research and development in the

software product family? 9 Does the organization’s past research improve the development and management processes

of the software product family? 10 Does the organization believe that investment in R&D can yield positive results in the near

future? Table 9: Innovation Assessment Questionnaire

Financial Management

Financial management deals with making decisions about fiscal matters within an organization. A financially strong organization envisions business progress, especially in terms of income, balance, and cash flow. Effective financial policies lead to successful businesses. The financial strength of an organization has a major impact on software product family development and management. Some of the financial indicators generally used in monitoring the performance of the business, found in [27] are as follows: Current Ratio: is the ratio between all current assets and all current liabilities,

i.e. sLiabilitieCurrentAssetsCurrent

__

; a ratio of more than 1 is in favorable in an

organization. Debt to Equity: shows the ratio between capital invested by the organization

and the funds provided by lenders, i.e. EquityDebt

; a lower value shows the

financial strength of an organization. Debt Coverage Ratio: indicates how well cash flow covers debt and the

capacity of the business to acquire additional debt, i.e. DebtTotalExpensesofitNet

_Pr_ +

; a

higher value indicates organization is earning well and can pay back its liabilities.


-22- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Sales Growth: a percentage increase (or decrease) in sales between two time

periods, i.e. SaleYearLastSaleYearLastSaleYearCurrent

______ −

X 100; a higher value shows

a growth in sales.

Net Profit Margin: indicates how much profit comes from sales, i.e. SaleTotalofitNet

_Pr_

;

an improvement in this ratio shows how effectively an organization is growing its’ sale.

Return on Assets: is a measure of how effectively assets are used to generate a

return, i.e. AssetsTotalofitNet

_Pr_

; a higher value indicates assets are being used

effectively for return. Return on Investment: is a measure of net benefits from a given investment,

i.e. InvestmentTotalofitNet

_Pr_

; a higher value shows the financial strength of

organization. Payback Period: is the number of years required for covering the cost of an

investment, i.e. SavingsPeriodicInvestmentTotal

__

; a lower value depicts the ability of an

organization to cover the market.

1 Is the current ratio of total assets and current liabilities higher than one? 2 Is the ratio of total debt to total capital decreasing over a period of time? 3 Is the organization able to reduce its debt? 4 Do the sales grow over a period of time? 5 Does the net profit margin increase over a period of time? 6 Does the return on assets increase over a period of time? 7 Does the return on investment increase over a period of time? 8 Does the payback period decrease over a period of time? 9 Does the software product family fit into the financial model of the organization? 10 Is the software product family contributing towards strengthening the financial position of

the organization? Table 10: Financial Management Assessment Questionnaire

Financial management revolves around the software product family. A successful software product family plays a key role in achieving the desired financial strength of an organization. Some of the financial indicators, such as current ratio, debt to equity and debt coverage ratio, highlight an organization’s ability to invest in the software product family. Sales growth and net profit margin depict how successfully the software product family contributes to business growth. Return on assets, return on investment and pay back period indicate the potential of the software product family to achieve the long-term financial goals of an organization. The questionnaire shown in Table 10 illustrates financial management assessment questions, and form a part of the software product family business assessment framework.


-23-© IBIS – Issue 1 (1), 2006

Software Product Family Business Evaluation Tool

(SPF_BET)

The business assessment of software product family of an organization requires input from the organization about the status of various activities that contribute in the performance of overall business process. The questionnaires presented in Tables 2, 3, 4, 5, 6, 7, 8, 9, and 10 serve as an initial source of contact to receive feedback from an organization. There are 10 questions for each key business factors, 90 questions altogether. A fuzzy logic-based tool was designed and implemented on the basis of questionnaires shown in Tables 2 to 10. This tool was intended to measure the business performance of an organization by processing the data of key business factors. It is important to mention here that a detailed discussion of the fuzzy logic approach and its methodology is beyond the scope of this paper. The overall processing sequence of the tool, shown in Figure 3, illustrates that:

The assessment of individual key business factors such as market orientation,

relationships management, and order of entry are measured by using the respective questionnaires as an input to a fuzzy logic system.

Overall business performance is evaluated by applying the assessment of

individual key business factors to the next stage fuzzy logic system.

Fuzzy logic system [28] [29] requires certain inputs to process. In fuzzy logic system, the term “crisp value” is used to represent any precise numerical value such as 2, –3, or 7.34. In order to take inputs in the form of crisp values, questionnaires shown in Tables 2 to 10 are used. The crisp input to the fuzzy logic system depends on the values entered for each question. In order to measure the extent to which each of the questions in the questionnaires about key business factors was practiced in the organization; we used multi-item, five-point Likert scales that ranged from “Strongly Disagree” (1) to “Strongly Agree” (5). Figure 4 illustrates a two-variable fuzzy logic system used for processing of key business factors data. The system requires the input of two variables, which can be any combination of two questions presented in Tables 2 to 10. These two variables perform a fuzzification process which converts the crisp input into a fuzzy membership mapping that is applied to the inference engine, which in turn interacts with rule base to select the applicable rules based on the input variable values. The fuzzy output is then defuzzified to retrieve a crisp output. The design decision of two variable approach of fuzzy logic is based on an associative property of fuzzy sets. Since the questions presented in Tables 2 to 10 can be further increased to accommodate other possible aspects of the software product family, this design choice can therefore easily accommodate further expansion of input to the system.


-24- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Figure 3: Processing Sequence of SPF_BET

Figure 4: Two-Variable Fuzzy Logic System Architecture


-25-© IBIS – Issue 1 (1), 2006

Figure 5: Triangular Fuzzy Set Figure 6: Triangular Model

The crisp input and output to the system is selected to fall in the range of 1 to 5. The crisp input values are divided into five linguistic variables: “Strongly disagree”, “Disagree”, “Neither agree nor disagree”, “Agree” and “Strongly Agree”. The crisp output values are divided into five linguistic variables: “Reactive”, “Awareness”, “Extrapolate”, “Proactive” and “Strategic”. They are the same maturity scales for business dimension that are put forward by van der Linden et al. [10]. The input and output variables are represented by a triangular function. The graphical representation and mathematical equation of triangular functions used to portray the linguistic variable of the input and output are shown in Figures 5 and 6. The triangular function retains the highest fuzzy membership value of “1” at a certain required point. The variables “a”, “b”, and “c” construct the shape of the triangle. The variables “a” and “c” represent the lower right and left points of the triangle where the fuzzy membership mapping is minimum of 0, whereas the variable “b” illustrates the highest fuzzy membership mapping of 1. The choice of variables a, b, and c to represent the triangular function for all five linguistic variables of input and output is illustrated in Table 11.

Triangular Function Variable Values For Fuzzy Membership

Mapping

Input Linguistic Variable

Output Linguistic Variable

Crisp Value Range

a b c Strongly disagree Reactive 1 to 2 1 1 2 Disagree Awareness 1 to 3 1 2 3 Neither agree nor disagree Extrapolate 2 to 4 2 3 4 Agree Proactive 3 to 5 3 4 5 Strongly agree Strategic 4 to 5 4 5 5 Table 11: Input (Likert Scale) and Output Linguistic Variables and Fuzzy Membership Mapping

The fuzzy logic rule base is created to contain fuzzy logic rules for fuzzy reasoning, particularly for the software product family business evaluation tool, by having discussions with experts in the various organizations actively involved in software product family business. The rules define combinations of the crisp inputs pattern and the respective output. On the basis of the inputs, appropriate output mapping is defined in the fuzzy logic rules. The variables defined as input_1 and input_2 can be any combination of questions presented in the questionnaires. There are fifteen rules for the software product family business evaluation tool. Table 12 shows the truth table of the fuzzy rule base.


-26- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

S.No Input_1 Input_2 Output 1 Strongly disagree Strongly disagree Reactive 2 Strongly disagree Disagree Awareness 3 Strongly disagree Neither agree nor

disagree Awareness

4 Strongly disagree Agree Extrapolate 5 Strongly disagree Strongly agree Extrapolate 6 Disagree Disagree Awareness 7 Disagree Neither agree nor

disagree Extrapolate

8 Disagree Agree Extrapolate 9 Disagree Strongly agree Extrapolate 10 Neither agree nor

disagree Neither agree nor disagree

Extrapolate

11 Neither agree nor disagree

Agree Proactive

12 Neither agree nor disagree

Strongly agree Proactive

13 Agree Agree Proactive 14 Agree Strongly agree Strategic 15 Strongly agree Strongly agree Strategic

Table 12: Truth Table of Fuzzy Rule Base

Figure 7: Input Screen Shot of Software Product Family Business Evaluation Tool


-27-© IBIS – Issue 1 (1), 2006

Figure 8: Output Screen Shot of Software Product Family Business Evaluation Tool

Figures 7 and 8 are input and output screen shots of the software product family business evaluation tool.

Case Study & Assessment Approach

Using the framework presented in this work, we conducted eight case studies in order to perform the business assessment of the organizations actively involved in software product family process. The input questionnaires shown in Tables 2 to 10 were distributed to the organizations in order to obtain actual data regarding the status of the software product family business within those organizations. The major sources of data, i.e., documents, plans, models and actors were identified after discussions with the organizations in order to reduce the chances of over-and-under estimation by human judgment in filling questionnaires and to increase the reliability of the approach. Table 13 illustrates some of the sources of data and actors involved in acquiring the data of key business factors of an organization. The organizations were requested to respond to each question in the questionnaires and to provide values in the range of 1 to 5 best reflecting their current process. The value “1” corresponds to a low rating (Strongly disagree) whereas the value “5” indicates a high rating (Strongly agree). After the questionnaires from the organizations were received, using SPF_BET, data values were processed. The maturity of individual key business factor and overall business performance of the organizations are then evaluated. To demonstrate the application of the framework we are presenting the case study of only one organization in this paper, mainly due to length of the paper. Business Key Process Area

Sources of Data Department or Actor Title

Market Orientation

Market Analysis, Competitors Information Survey, Strategic Marketing Plans, Sales Mission Statement, Business Model, Advertising, Strategies, Competition and Buying Patterns, Sales

Sales Force, Marketing Strategist, Business Analyst Portfolio Analyst, Domain Engineer


-28- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Forecast, Product Portfolio, Domain Model Relationships Management

Sales Data, Customer Profiling and History, Customers Complaint Log, Product Promotions Plans and Effects, Product Advertising Plans, Public Relations, Procedures of Sales and Distribution, Customer Inquiries and Satisfactions Ratio

Customer Relation Officer Sales Force Customer Support Representative Product Developers Requirements Engineer

Order of Entry Business Model, Competition and Buying Patterns, Product Launch Timings, Business Case Evaluation, Sales Projections, Sales Data, Market Trend Analysis, Domain Model

Sales Force, Business Analyst Marketing Strategist, Senior Management, Production Team Domain Engineer, Application Engineer

Brand Name Strategy

Business Model, Brand Strength, Sales and Distribution Procedures, Competition and Buying Patterns, Brand Competitors Threat Analysis, Product Portfolio, Domain Model

Sales Force Business Analyst Marketing Strategist Senior Top Management

Business Vision Business Vision Statement Senior Top Management Strategic Planning

Strategic Planning Document, Strategic Plans Reviews, Strategic Planning Change Requests, Strategic Plans Implementation Guidelines, Organizational Communications Procedures

Senior Top Management Middle Management Supervisory Staff Product Developers

Assets Management

Core Assets Repository, Assets Utilization History, Product Log, Commonality Management, Product Features, Variability Management, Requirements Engineering Documents,

Developers System Analyst Requirements Engineer Assets Management Team

Innovation Research Plans, Product Innovative Features, Research Financial Model, Competitors Product Analysis, Domain Model

Research Staff, Senior Top Management, Middle Management

Financial Management

Balance Sheet, Financial Statement, Projected Profit-Cost Analysis, Cash Flow, Sales Forecast

Financial Controller Senior Top Management

Table 13: Sources of Data of Business Evaluation Framework

Case Study

Organization “A” has been actively involved in the business of telecommunications and is one of the largest organizations in the mobile phone industry. The data provided by organization “A”, shown in Table 14, illustrates their current business status of the software product family. Table 15 shows the results prepared by SPF_BET, using the data provided by the organization. A number of key business factors, such as market orientation, relationships management, order of entry, and assets management are at the “Extrapolate” level. Brand name strategy, business vision, innovation, and financial management are at the “Proactive” level. The organization has also achieved the level of “Proactive” in the area of strategic planning, and is moving towards “Strategic” level. The overall business maturity of Organization “A” is found at “Proactive” level. This relatively higher level of business performance depicts the organizational commitment and abilities to adopt the software product family business in a successful manner. However, Organization “A” can further increase its business performance by incorporating improvements in the categories of market orientation, relationship management, order of entry, assets management and innovation.

Question Number of Questionnaires Business Key Factors Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10


-29-© IBIS – Issue 1 (1), 2006

Market Orientation 5 4 3 4 3 3 2 2 2 2 Relationships Management 4 1 3 3 2 4 4 2 2 2 Order of Entry 2 1 2 1 3 4 4 4 2 2 Brand Name Strategy 4 4 4 4 5 4 4 4 2 4 Business Vision 5 5 5 2 4 2 3 4 5 3 Strategic Planning 4 4 2 4 4 4 4 4 4 5 Assets Management 2 2 3 4 2 2 2 2 2 2 Innovation 4 4 4 4 4 4 4 3 4 4 Financial Management 5 4 5 4 3 4 4 4 3 4

Table 14: Business Assessment Input Data of Organization “A”

Business Key Factors Maturity Value Maturity Scale

Market Orientation 3 Extrapolate Relationships Management 3 Extrapolate Order of Entry 3 Extrapolate Brand Name Strategy 4 Proactive Business Vision 4 Proactive Strategic Planning 4.22 Proactive to Strategic Assets Management 3 Extrapolate Innovation 4 Proactive Financial Management 4 Proactive Overall Business Evaluation 4 Proactive

Table 15: Business Performance of the Case Study

Figure 9 describes the processing sequence and intermediate results collected at each of the stages of two-variable fuzzy logic systems during evaluation of key business factor of “order of entry” for this case study. The same structure and architecture is used for preprocessing and evaluation of all the other key business factors. Figure 10 illustrates the processing of key business factors used to evaluate the overall business assessment of case study.

Figure 9: Order of Entry Intermediate Processing Sequence and Results Using SPF_BET


-30- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Figure 10: Intermediate Processing Sequence and Results of Business Assessment of Case Study using SPF_BET

Final Remarks & Future Work

This research contributes towards establishing a comprehensive and unified strategy for maturity assessment of software product family process. An assessment framework for measuring the business dimension of software product family process has been put forward in this paper. The software product family business evaluation tool presented in this work can be used to preprocess the key business factors data and to evaluate the overall business maturity of an organization. The framework and tool provide direct mechanisms to measure the current maturity level of software product family business of an organization. The case study presented in this research shows the performance of an organization in the business of software product family, as well as demonstrates the application of the framework. Currently we are working on developing a Process Maturity Model for process assessment of software product families. This business assessment framework is a part of this research.


-31-© IBIS – Issue 1 (1), 2006

References

[1] R.V. Ommering, Beyond product families: building a product population, in, Proceedings of the Conference on Software Architectures for Product Families, (2000), pp.187-198.

[2] T. Wappler, Remember the basics: key success factors for launching and institutionalizing a software product line, in, Proceedings of the 1st International Conference on Software Product Lines, (2000), pp. 73-84.

[3] P.C. Clements, On the importance of product line scope, in, Proceedings of the 4th International Workshop on Software Product Family Engineering, (2001), pp. 69-77.

[4] G. Buckle, P.C. Clements, J.D. McGregor, D. Muthig, K. Schmid, Calculating ROI for software product lines, IEEE Software 21(3) (2004) 23-31.

[5] F. van der Linden, Software product families in Europe: the ESAPS & Café projects, IEEE Software 19(4) (2002) 41-49.

[6] P.C. Clements, L.G. Jones, L.M. Northrop, J.D. McGregor, Project management in a software product line organization, IEEE Software 22(5) (2005) 54-62.

[7] L. Jones, A. Soule, Software process improvement and product line practice: CMMI and the framework for software product line practice, Software Engineering Institute, Available from: http://www.sei.cmu.edu/pub/documents/02.reports/pdf/02tn012.pdf, (2002).

[8] P.C. Clements, L.M Northrop, Software product lines practices and pattern, Addison Wesley, (2002).

[9] F. Ahmed, L.F. Capretz, A framework for software product line process assessment, Journal of Information Technology Theory and Application 7(1) (2005) 135-157.

[10] F. van der Linden, J. Bosch, E. Kamsties, K.Känsälä, H. Obbink, Software product family evaluation, in: Proceedings of the 3rd International Conference on Software Product Lines, (2004), pp. 110-129.

[11] J. Bayer, O. Flege, P. Knauber, R. Laqua, D. Muthig, K. Schmid, T. Widen, J.M. DeBaud, PuLSE: a methodology to develop software product lines, in: Proceedings of the 5th ACM SIGSOFT Symposium on Software Reusability, (1999), pp. 122-131.

[12] K.C. Kang, P. Donohoe, E. Koh, J. Lee, K. Lee, Using a marketing and product plan as a key driver for product line asset development, in: Proceedings of the 2nd International Conference on Software Product Lines, (2002), pp.366-382.

[13] P. Toft, D. Coleman, J. Ohta, A cooperative model for cross-divisional product development for a software product line, in: Proceedings of the 1st International Conference on Software Product Lines, (2000), pp. 111-132.

[14] C. Fritsch, R. Hahn, Product line potential analysis, in: Proceedings of the 3rd International Conference on Software Product Lines, (2004), pp. 228-237.

[15] K. Schmid, M. Verlage, The economic impact of product line adoption and evolution, IEEE Software 9(4) (2002) 50-57.

[16] C. Ebert, M. Smouts, Tricks and traps of initiating a product line concept in existing products, in: Proceedings of the 25th International Conference on Software Engineering, (2003), pp. 520-525.

[17] A. Kohli, B. Jaworski, Market orientation: the construct, research propositions, and managerial implications, Journal of Marketing 54 (1990) 1-18.

[18] D. T. Wilson, An integrated model of buyer-seller relationships, Journal of the Academy of Marketing Science 23 (1995) 335-345.

[19] L., Crosby, K. Evans, D. Cowles, Relationship quality in services and selling: an interpersonal influence perspective, Journal of Marketing 54 (1990) 68-81.

[20] H. Ansoff, J. Stewart, Strategies for technology-based business, Harvard Business Review 43 (1967) 71-83.

[21] W.Robinson, C. Fornell, M. Sullivan, Are market pioneers intrinsically better than later entrants? Strategic Management Journal 13 (1992) 609-624.

[22] P.D. Bennett, Dictionary of Marketing Terms, American Marketing Association, (1988).


-32- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

[23] A. Bergstrom, Cyber branding: leveraging your brand on the internet, Strategy and Leadership 28(4) (2000) 10-15.

[24] E. F. Harrison, Strategic planning maturities, Management Decisions 33(2) (1995) 48-55. [25] J. C. Chen, Enterprise computing assets management: A case study, Industrial Management

and Data System 102(2) (2002) 80-88. [26] A. Martensen, J. J. Dahlgaard, Strategy and planning for innovation management: a business

excellence approach, International Journal of Quality and Reliability Management 16(8) (1999) 734-755.

[27] L.J. Gitman, S.M. Hennessey, Principles of managerial finance, Addison Wesley, 2004. [28] L. A. Zadeh, The concept of a linguistic variable and its applications to approximate

reasoning-I, Information Sciences 8 (1975) 199-249. [29] L. A. Zadeh, The concept of a linguistic variable and its applications to approximate

reasoning-II, Information Sciences 8 (1975) 301-357.

Bibliography

Faheem Ahmed is a Ph.D. candidate at University of Western Ontario, Canada, where he received his Masters degree in Electrical Engineering with emphasis in Software Engineering. He received his M.Sc degree in Electronics from Quaid-e-Azam University, Islamabad, Pakistan. Before joining Western as graduate student he has been working in the software industry for 10 years. During his professional career he has been actively involved in requirements engineering, design, and development

and testing of software products. His current research interests are software engineering, software product line process modeling and process assessment, CASE tools, fuzzy logic, object-oriented design and programming languages. He is a student member of IEEE.

Luiz F. Capretz has over 20 years of experience in the software engineering field as a practitioner, manager and educator. Before joining the University of Western Ontario, in Canada, he has worked since 1981 at both the technical and managerial levels, taught and carried out research on the engineering of software in Brazil, Argentina, England, and Japan. He was the Director of Informatics and Coordinator of the computer science program at two universities (UMC and COC) in the State of Sao Paulo/Brazil. He has authored and co-authored over 50

peer-reviewed research papers on software engineering in leading international journals and conference proceedings, and has co-authored the book, Object-Oriented Software: Design and Maintenance, published by World Scientific. His current research interests are software engineering (SE), human factors in SE, software estimation, software product lines, and software engineering education. Dr. Capretz received his Ph.D. from the University of Newcastle upon Tyne (U.K.), M.Sc. from the National Institute for Space Research (INPE-Brazil), and B.Sc. from UNICAMP (Brazil). He is a senior member of IEEE.


-33- © IBIS – Issue 1 (1), 2006

Formulation Schema Matching Problem for Combinatorial Optimization Problem

Zhi Zhang 1*, Haoyang Che 2, Pengfei Shi 1, Yong Sun 3, Jun Gu 3

1 Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University,

Shanghai 200030, China 2 Institute of Software, The Chinese Academy of Sciences, Beijing 100080, China

3 Department of Computer Science, Science & Technology University of Hong Kong, Hong Kong, China

Abstract: Schema matching is the task of finding semantic correspondences between elements of two schemas, which plays a key role in many database applications. In this paper, we cast the schema matching problem (SMP) into a multi-labeled graph matching problem. First, we propose an internal schema model: multi-labeled graph model, and transform schemas into multi-labeled graphs. Therefore, SMP reduce to a labeled graph matching, which is a classic combinatorial problem. Secondly, we study a generic graph similarity measure based on Contrast Model, and propose a versatile optimization function to compare two multi-labeled graphs. Then, we can design the optimization algorithm to solve the multi-labeled graph matching problem. Based on the matching result obtained by greedy matching, we implement a fast hybrid search algorithm to find the feasible matching results. Finally, we use several schemas to test the hybrid search algorithm. The experimental results confirm that the algorithm model and the hybrid algorithm are effective.

Introduction

The goal of schema matching is to find semantic correspondences between the elements of two schemas. It plays a key role in many database applications such as schema integration, data warehousing, e-business, XML message mapping, and semantic query processing [19]. However schema matching still remains largely a manual, labor-intensive, and expensive process. Problem formulation is an extremely important part of problem solving. The choice of a good formulation can result in order of magnitude savings in solving cost. In this paper, we study how to cast the schema matching problem (SMP) into a multi-labeled graph matching problem. For multi-labeled graph matching, which is a kind of graph matching problems. It is well known that graph matching is a classic combinational optimization problem. There are many approaches to deal with graph matching problem. Therefore, based on the framework of graph matching, we can design heuristic approach to attack schema matching. First, we propose a meta-model: multi-labeled graph model, to represent various kinds of schemas. We extract the elements of schema as the vertices of a graph, and the properties of elements as the labels of vertices, where each vertex and edge can be associated with a set of labels describing its properties. For labeled graph matching, we want to obtain the correspondences between the vertices of two graphs. Therefore, we discuss a generic graph similarity measure based on * Corresponding author. E-mail address: [email protected] (Z. Zhang).


-34-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

Contrast Model, and propose an optimization function based on multi-labeled graph similarity. Up to now, we transform SMP into a multi-labeled graph matching problem which is a classic combinational problem, and develop the algorithmic model for SMP. Finally, we implement a hybrid search algorithm to find the feasible matching correspondences. The paper is organized as follows. Section 2 discusses related work on schema matching. Section 3 presents a meta-model of schemas: multi-labeled graph. Section 4 introduces the definition of SMP based on multi-labeled graph. We call SMP as multivalent matching, which is composed of multivalent correspondences. Then, we formulize SMP as a multi-labeled graph matching problem. Section 5 investigates a generic graph similarity measure based on Contrast Model, and proposes an objective function to schema matching. Then, Section 6 studies a hybrid search algorithm in detail. In section 7, we use some experiments to evaluate our approach. Section 8 makes some concluding remarks and discusses our future work.

Related Work

Numerous solutions have been proposed in specific applications to solve SMP. Madhavan et al. [13, 18] implemented a Cupid system to achieve semi-automatic schema matching, which uses a hybrid matching algorithm comprising linguistic and structural schema matching techniques, and computes similarity coefficients with the assistance of a precompiled thesaurus; Machine learning is a promising technique especially for evaluating data instances to predict element similarity, the LSD system [10] uses machine-learning techniques to match a pair of schemas. The accuracy of the predictions depends on a suitable training. The predictions of individual matchers are combined by a so called meta-learner, which weights the predictions from a matcher according to its accuracy shown during the training phase; Berlin and Motro [3] devised Automatch system for database schema matching which also uses machine learning techniques, bases primarily on Bayesian learning. Automatch acquires probabilistic knowledge from examples of schemas that have been “mapped” by domain experts into a knowledge base of database attributes called the attribute dictionary. Then, Automatch uses the attribute dictionary to find an optimal matching; Melnik et al. [14, 15] used the graph matching algorithm – Similarity Flooding to achieve schema matching, which can measure the similarity between vertices of two schemas. The similarity between pairs of vertices, described by a nonnegative vector, is computed iteratively until convergence to a fixed point; Bouquet [5] viewed each semantic schema as a context, and proposed an algorithm based on SAT solver to matching two schemas; Furthermore, based on [5], Giunchiglia et al. [11] developed S-Match algorithm which is a schema-based schema/ontology matching system implementing semantic matching approach. It takes two graph-like structures (e.g., database schemas or ontologies) as input and returns semantic relations between the nodes of the graphs that correspond semantically to each other as output. They used five semantic relations to represent the matching relationships between two elements: equivalence, more general, less general, mismatch, and overlapping; Miller proposed a semi-automated mapping tool Clio to obtain mappings between a given target schema and a new schema [16]. The algorithm regards schema mapping as


-35- © IBIS – Issue 1 (1), 2006

query discovery, which uses query search method to match the schemas; Do and Rahm [8] devised the COMA schema matching system. It follows a composite approach, which provides an extensible library of different matchers and supports various ways for combining match results. For the details of SMP, we can refer to two surveys of schema matching [9, 19]. Graphs are versatile representation tools that have been used in schema matching [13, 14, 15]. In [24], Zhang et al. proposed a meta-meta structure based on universal algebra, which is named multi-labeled schema. In [25], they use a multi-labeled graph model as the internal schema model, which is an instance of multi-labeled schema. As a result, SMP can be reduced to a graph matching problem. The graph matching problem (i.e., graph homomorphism) is one of the classic combinatorial optimization problems. To retrieve similar case in a CBR system, Champin and Solnon [6] proposed a generic similarity measure model to compare multi-labeled graphs based on Contrast Model [21]. Contrast Model has been proposed by Tversky, wherein similarity is determined by matching features of compared entities. Based on their work, Zhang et al. [25] used the labeled graph similarity model to design a greedy matching algorithm. In this paper, we formulize the schema matching problem as a multi-labeled graph matching problem. Then, we discuss the similarity measure of multi-labeled graph based on Contrast Model, and propose the best matching result based on features of two schemas. At last, we design a hybrid search algorithm to solve this combinational optimization problem.

Multi-labeled Graph Model

Multi-labeled Schema

There are many kinds of schemas, such as relational model, object-oriented model, ER model, conceptual graph, DTD, XML schema, etc. In [24], Zhang et al. proposed a meta-meta model of schema: multi-labeled schema, which views schemas as finite structures over the specific signatures.

Definition 1. (Schema) A schema S is a finite structure over a signature σ ,

consists of individual set SI , label collection SLab , function set SF , relation set SR , written a 4-tuples = S S S SS ( , , , )I Lab F R , where,

1. σ is a finite collection that is composed of individual symbols, label symbols, function symbols, and relation symbols, where, each function symbol f or relation symbol R, respectively comes associated with an arity, ar(f) and ar(R), which are non-negative integers.

2. = ⋅ ⋅ ⋅S1 2{ , , , }nI s s s is a finite nonempty set that includes individuals,

which denote the prepared-matching objects. Each of them is uniquely identified by an object identifier (OID).


-36-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

3. = ⋅ ⋅ ⋅S S S S1 2{ , , , }iLab Lab Lab Lab is a finite constant collection that

includes the label sets for individuals. The labels are the strings for describing the properties of individuals.

4. = ⋅ ⋅ ⋅S S S S1 2{ , , }jF f f f is a finite set that includes the labeling

functions, which are partial function. The domain of each function is the individual set, accordingly, the codomain is the label collection.

5. = ⋅ ⋅ ⋅S1 2{ , , , }kR R R R is a finite nonempty set that includes the

relations between individuals. If R is a b-ary relation, then ⊆ S( )bR I .

6. The size of schema S is the size of individuals and is denoted by S| |I .

Multi-labeled Graph Model

Based on multi-labeled schema, Zhang et al. [25] proposed a multi-labeled graph model, which is an instance of multi-labeled schema, to describe various schemas, where each vertex and edge can be associated with a set of labels describing its properties. Such a multi-labelling could be very useful to describe schemas more accurately.

Definition 2. A schema S can be represented by a labeled graph structure

= S SS S SS ( , , , , )V EV E Lab r r .

1. V is the finite set of vertices. Vertices are prepared-matching objects, and each of them is uniquely identified by an object identifier (OID).

2. ⊆ ×S S SE V V is the finite set of edges. Each of edges denotes the relation between two vertices.

3. = S SS { , }V ELab Lab Lab is the finite constant collection of labels.

The labels are strings for describing the properties of vertices and edges. SVLab is the finite collection of vertex labels; SELab is the finite collection of edge labels.

4. ⊆ ×S SV Vr V Lab is a relation associating labels to verteices, i.e., SVr is the set of couples ( , )iv l such that vertex vi is labeled by l. SVr is

called vertex feature of S.

5. ⊆ ×S SE Er E Lab is a relation associating labels to edges, i.e., SEr is the set E of triples ( , , )i jv v l such that ( , )i jv v is labeled by l. SEr is

called edge feature of S.


-37- © IBIS – Issue 1 (1), 2006

6. = ∪S SS( ) V Edescr r r is the set of all vertex and edge features of a

schema S that completely describes the schema S.

7. =| |V n is the cardinality of schema S.

In Table 1, we show the correspondences between multi-labeled schema and multi-labeled graph:

Multi-labeled schema Multi-labeled graph domain codomain SI SV - -

SLab SLab - - SR SE - -

→ SS( ) Vf V Lab ⊆ ×S S

SV Vr V Lab SV SVLab

SF × → S

S S( ) Ef V V Lab ⊆ × SS

E Er E Lab ×S SV V SVLab

Table 1. The correspondences between multi-labeled schema and multi-labeled graph

Encode schemas into labeled graphs

Encoding rules

For encoding relational schemas, XML schemas, SQL views, etc. as multi-labeled graphs, we use the following rules:

1. A vertex of graph represents the prepared matching object of schema. V is the vertex set that comprises all prepared matching objects of a schema;

2. The labels of a vertex are composed of properties of a prepared matching object;

3. An edge represents the relation between two prepared matching objects of schema. E is the edge set that comprises all relations of schema ( ⊆ ×E V V );

4. The labels of one edge comprise properties of two prepared matching objects, such as is-a, part-of, etc.

Motivating Scenario

The SMP is a critical problem for interoperability in heterogeneous information sources, which plays a key role in many database applications. In this section, we introduce a real-life scenario happens in e-business to illustrate our algorithm framework.


-38-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

<Schema name="Schema S " xmlns="urn:schemas-microsoft-com:xml-data"> <ElementType name="AccountOwner">

<element type="Name"/> <element type="Address"/> <element type="Birthdate"/> <element type="TaxExempt"/>

</ElementType> <ElementType name="Address">

<element type="Street"/> <element type="City"/> <element type="State"/> <element type="ZIP"/>

</ElementType> </Schema>

<Schema name="Schema T " xmlns="urn:schemas-microsoft-com:xml-data"> <ElementType name="Customer">

<element type="CFname"/> <element type="CLname"/> <element type="CAddress"/>

</ElementType> <ElementType name="CustomerAddress">

<element type="Street"/> <element type="City"/> <element type="Province"/> <element type="PostalCode"/>

</ElementType> </Schema>

For a multinational company, there are two subsidiary companies locate at different countries (company A in S area and company B in another T area), and the companies want to share and interoperate their customers’ information by Web Service. The XML description schemas are deployed on their own XML web services. However, the XML schemas used by the company undergoes periodic changes due to the dynamic nature of its business. If do the schema matching by manual operate, it is a tiresome and costly work. Moreover, if the company A changes its customer information database structure, and the XML schema is changed synchronously, but they do not notice the company B to update its Web Service correspondingly, under this conditions, if the interoperate wants to carry out successfully, two web agents have to automatic matching their schemas again, and need not manual acting. The automatic schema matching can improve the reliability and usability of Web services. Now, two XML schemas are shown in Fig.1, which are based on BizTalk Schema specification, where, Schema S is used by

company A, and Schema T is deployed by company B.

Fig. 1 Two XML schemas (BizTalk)

At first, from Fig.1, we can obtain the vertices and edges of two schemas, which are shown in Fig.2, where, = ⋅ ⋅ ⋅S

1 2 11{ , , , }V s s s , = ⋅ ⋅ ⋅T1 2 10{ , , , }V t s t .

Fig. 2 Vertices and edges of S and T

Then, by Definition 2, Table 2 shows the labels of vertices. LabV includes the name set LabVname, the concept set LabVconcept or the type set LabVtype. In the same way, LabE can include the labels for edges. Here, LabE = {part-of}.

S

s2

s6s5s4s3

s7

s11s10s9s8

s1

T

t2

t5t4t3

t6

t10t9t8t7

t1


-39- © IBIS – Issue 1 (1), 2006

Labels of schema S Labels of schema T OID

SVnameLab S

VconceptLab SVtypeLab T

VnameLab TVconceptLab T

VtypeLab

1 2 3 4 5 6 7 8 9 10 11

Schema S AccountOwner Name Address Birthdate TaxExempt Address Street City State ZIP

Schema account + owner name address birthdate tax-exempt address street city state ZIP

Schema ElementType element element element element ElementType element element element element

Schema T Customer CFname CLname CAddress CustomerAddress Street City Province PostalCode

schema customer first name last name address address street city province postal code

Schema ElementType element element element ElementType element element element element

Table 2. The labels of vertices

Multi-labeled Graph Matching

Schema Matching

The goal of schema matching is to find the semantic correspondences between the elements of two schemas. We describe SMP informally as follows: Problem 1. SMP Instance: Given two schemas = S S

S S SS ( , , , , )V EV E Lab r r and

= T TT T TT ( , , , , )V EV E Lab r r , S is a source schema, and T is a target schema.

Question: To find the semantic correspondences between vertices in SV and TV . In [24, 26], Zhang et al. investigated the formal framework for SMP, they proposed the concept of individual matching: if one or more labels of vertex s in S are

semantically related to corresponding labels of vertex t in T, or the relations of s and the relations of t are semantically equivalent, then we define that s and t are matched. Based on the definition of individual matching, they presented an important notion: multivalent matching.

Multivalent Matching

Multivalent matching: a vertex of one schema may be associated with a set of vertices of another schema, which can characterize the many-to-many matching between two schemas [25].

Definition 3. If S is the source schema, T is the target schema, the matching

result of two schemas is a set ⊆ ×m S TV V that contains every matched couple ∈ ×S T( , )s t V V .


-40-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

The matching couples are called multivalent correspondences, which are binary relationships that aiming to establish many-to-many correspondences between the vertices of two schemas.

Matching State and Matching Space

By Definition 3, the matching result can be represent as a relation set m, we introduce an equivalent concept: matching matrix or matching state to denote the solution of schema matching.

Definition 4. (Matching Matrix) If S is the source schema, T is the target schema,

where, = ⋅ ⋅⋅S1 2{ , , , }nV s s s and = ⋅ ⋅⋅T

1 2{ , , , }kV t t t denote vertex

sets of S and T respectively. A matching state m is a kn × 0-1 matrix:

⋅ ⋅ ⋅⎡ ⎤⋅ ⋅ ⋅⎢ ⎥= ⋅ ⋅ ⋅⎢ ⎥

⎢ ⎥⋅ ⋅ ⋅⎣ ⎦M M Mm1, 1 1, 2 1,

2, 1 2,2 2,

, 1 , 2 ,

k

k

n n n k

m m mm m m

m m m,

∈

= =S T

, {0, 1},

| |, | |i jm

n V k V 1)

where, =, 1i jm denotes sj and tj are matched and =, 0i jm denotes sj and tj are unmatched. All the matching couples compose the result of schema matching.

Given an assignment to a ×S T| | | |V V matrix, we can obtain a possible matching result of two schemas. All of these matching states constitute the matching space.

Definition 5. (Matching Space) If S is the source schema, T is the target schema,

where, = ⋅ ⋅⋅S1 2{ , , , }nV s s s and = ⋅ ⋅⋅T

1 2{ , , , }kV t t t denote vertex

sets of S and T respectively. All the assignments of a ×S T| | | |V V ×( )n k matrix constitute the matching space M, where,

{ }{ }∈ ⋅ ⋅⋅= ∈ ∈ ⋅ ⋅⋅, {1, 2, , }: 0, 1 , {1, 2, , }i j

i nm j kM m 2)

The scale of matching space is the number of the matching states: ×=| | 2n kM .

Schema Homomorphism

Zhang et al. [24] prove that: Two schemas S and T are matched iff there exists a

semantic homomorphism from S to T, →S T . In Table 1, we show the correspondences between multi-labeled schema and multi-labeled graph, here, we present the definition of schema homomorphism based on the multi-labeled graph:


-41- © IBIS – Issue 1 (1), 2006

Definition 6. A schema homomorphism (SHOM) ϕ →S T: from the source schema

S to the target schema T is a mapping ϕ →S T:V V , which is a set of multivalent correspondences such that:

Condition 1. There exists a labeling function symbol f of arity n

ϕ ϕ ϕ⋅ ⋅ ⋅ = ⇒ ⋅ ⋅ ⋅ =S S T S1 1( , , ) ( ( ), , ( )) ( )n n n nf s s l f s s l , for ⋅ ⋅ ⋅ ∈ S

1, , ns s V , ∈S S

nl Lab

Condition 2. There exists a semantic relation symbol R of arity m

ϕ ϕ⋅ ⋅ ⋅ ⇒ ⋅ ⋅ ⋅S T1 1( , , ) ( ( ), , ( ))m mR s s holds R s s holds , for ⋅ ⋅ ⋅ ∈ S

1, , ms s V

Because we use the multi-labeled graphs to represent the schemas, the schema homomorphism problem is reduced to a graph homomorphism problem, also called the graph matching problem. Now, we formulize SMP as a multi-labeled graph matching problem, which is a NP-hard problem [2, 6]. Based on SHOM and the multi-labeled graph model, for solving SMP: we will find a semantic homomorphism between two multi-labeled graphs, the homomorphic mapping includes the matching correspondences between two graphs S and T. In the rest sections, we will develop a practical matching algorithm to solve this intractable problem.

Similarity of Multi-labeled Graphs Based on Contrast Model

There are many methods to compare the similarity of two graphs, such as graph isomorphism, subgraph isomorphism, graph edit distance, maximum common subgraph, iterative method [2, 4, 20, 25], etc. In this paper, we propose the multi-labeled graph as the meta-model for schemas, so we will study how to achieve schema matching by using the multi-labeled graph matching method. Champin and Solnon [6] proposed a generic method to measure the similarity of two directed multi-labeled graphs, which is based on the features of vertices and edges, i.e., the Contrast Model of Tversky [21]. Based on Contrast model, we investigate a schema matching approach based on common features between two multi-labeled graphs.

Similarity Measure: Contrast Model

In [21], Tversky proposed a similarity approach: Contrast Model. In this model, entities are represented as a collection of features (e.g., object a can be represented by a feature set A), and similarity between objects a and b can be computed by:

θ α β= ∩ − − − −( , ) ( ) ( ) ( )Tverskysim a b f A B f A B f B A 3)


-42-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

The similarity of A to B is expressed as a linear combination of the measure of the common and distinctive features. The term ∩A B represents the features that items A and B have in common. −( )A B represents the features that A has but B does not. −( )B A represents the features of B that are not in A. θ , α , and β are weights for the common and distinctive components, and the function f is often simply assumed to be additive. A feature may be any property, characteristic or aspect of an object [12]. Fig.3 shows the basic principle of Contrast Model.

g

f

f

e

dc

s

r

qop

n

m

l

k

jIh

v

u

ts

q

et

t

u

Unique to acommon

vv

Object a AObject b B

Unique to b

ec

Fig.3 Representation of two objects that each contains its own unique features and also contains common features. An important aspect of Tversky's model is that similarity depends not only on the proportion of features common to the two objects but also on their unique features (i.e., the differences between two objects). Each letter here represents a feature.

A number of models are similar to Contrast Model in basing similarity on features and in using some combination of the ∩A B , −( )A B , and −( )B A , such as Sjoberg proposes that similarity is defined as ∩ ∪( )/ ( )f A B f A B , Eisler and Ekmanclaim that similarity is proportional to ∩ +( )/ ( ) ( )f A B f A f B , Bush and Mosteller defines similarity as ∩( )/ ( )f A B f A . As such, they differ from Contrast Model by applying a ratio function as opposed to a linear contrast of common and distinctive features [12]. These three models can all be considered specializations of the general equation:

α β∩

=∪ − − − −

( )( , )

( ) ( ) ( )Tversky

f A Bsim a b

f A B f A B f B A 4)

Features of Schema

The features of a schema are composed of all the properties of the schema. In section 3, we present the meta-model of schemas, i.e., multi-labeled graph model. A schema can be represented by a graph, where each vertex and edge can be associated with a set of labels describing its properties. Therefore, the feature set of a schema is the vertices and edges of schema and the labels of them (From Definition 2, a schema S is described by the feature set descr(S) of all its vertex

and edge features). For example, we show the feature sets of schema S and T as follows:

1. Features of S: = S S{ , }V Edescr r r

=SVr {(s1, Schema S), (s1, schema), (s1, Schema); (s2, AccountOwner), (s2, account + owner), (s2, ElementType); (s3, AccountOwner.Name), (s3, name), (s3, element);…


-43- © IBIS – Issue 1 (1), 2006

(s11, Address.ZIP), (s11, ZIP), (s11, element)} =SEr {(s1, s2, part-of), (s2, s3, part-of), …, (s7, s11, part-of)}

2. Features of T: = T T{ , }V Edescr r r

=TVr {(t1, Schema T ), (t1, schema), (t1, Schema); (t2, Customer), (t2, customer), (t2, ElementType); … (t10, PostalCode), (t10, postal code), (t10, element)};

=TEr {(t1, t2, part-of), (t2, t3, part-of), …, (t6, t10, part-of)}

Common features with respect to a matching state

Based on Contrast Model, the similarity of two different schemas S and T depends

on both the common features of descr(S) and descr(T ). Given a matching state

∈m M, we can calculate the common features of descr(S) and descr(T ):

′ ′ ′

∩ =

⎧ ⎫∈ ∃ = ∈⎪ ⎪⎨ ⎬

∈ ⋅ ⋅⋅ ∈ ⋅ ⋅⋅⎪ ⎪⎩ ⎭∈ ∃ = ∈⎧ ⎫

∪ ⎨ ⎬∈ ⋅ ⋅⋅ ∈ ⋅ ⋅⋅⎩ ⎭

∈ ∃ = ∃∪

&m

S T

T S

S

S T( ) ( )

( , ) 1, ( , )

{1, 2, , }, {1, 2, , }

( , ) 1, ( , )

{1, 2, , }, {1, 2, , }

( , , ) 1,

a aj jV V

b ib iV V

c c cj cE

descr descr

s l r m t l r

a n j k

t l r m s l r

b k i n

s s l r m m ′

′ ′ ′ ′

⎧ ⎫= ∈⎪ ⎪⎨ ⎬′ ′∈ ⋅ ⋅⋅ ≠⎪ ⎪⎩ ⎭

∈ ∃ = ∃ = ∈⎧ ⎫∪ ⎨ ⎬′ ′∈ ⋅ ⋅⋅ ≠⎩ ⎭

T

T S

1, ( , , )

, {1, 2, , },

( , , ) 1, 1, ( , , )

, {1, 2, , },

j j j E

c c ic i c i iE E

t t l r

c c n c c

t t l r m m s s l r

c c k c c

5)

Splits with respect to a matching state

For SMP, we allow a vertex of S can be matched with a set of vertex of T, therefore, if given a multivalent mapping m, we also have to identify the set of split vertices, i.e., the set of vertices that are mapped to more than one vertex, each split vertex v being associated with the set pv of its mapped vertices:

{ }⎧ ⎫∈ ≥⎪ ⎪= ⎨ ⎬== ∈⎪ ⎪∈ ⋅ ⋅⋅⎩ ⎭⎧ ⎫∈ ≥⎪ ⎪

∪ =⎧ ⎫⎨ ⎬= ∈⎨ ⎬⎪ ⎪∈ ⋅ ⋅⋅⎩ ⎭⎩ ⎭

m

S

T

T

S

,

,

, 2 ( ) : ( , ) 1 {1, 2, , }

, 2

( , ) 1

{1, 2, , }

s

s s ts

t

t s tt

s V psplits s p mp t V t k

t V p

t p mp s V

s n

6)

The more detailed discussions can see [6, 25], and we give an example in section 5.7.2 to explain the reason that we should identify the set of split vertices.


-44-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

Similarity function

By Eq.4, let α = 0 , β = 0 , the similarity of S and T with respect to a matching state m is defined by:

∩ −=

∪m

mmsim S T

S TS T

( ( ) ( )) ( ( ))( , )

( ( ) ( ))f descr descr g splits

f descr descr 7)

where f and g are two functions that are introduced to weigh features and splits, depending on the desired application. Indeed, we can design different similarity function to compare two schemas based on the variants of Contrast Model, i.e., we can obtain the other similarity measures by Eq.4. Here, f and g are cardinality functions:

= ⋅ + ⋅ + ⋅ + ⋅S S S SS( ( )) name name concept concept type type EV V V Ef descr w r w r w r w r 8)

′= ⋅m m( ( )) ( )g splits w splits 9)

Finally, the maximal similarity sim(S, T ) of two schemas S and T is the greatest similarity with respect to all possible matching states:

⊆

∩ −=

∪msim S T

S TS T

( ( ) ( )) ( ( ))( , ) max

( ( ) ( ))f descr descr g splits

f descr descrm

m M 10)

The denominator )()( TS descrf(descr ∪ of Eq.7 does not depend on the matching states, which is introduced to normalize the similarity value to the 0-1 range [6]. Hence, to compute the maximum similarity between two graphs S and T, one has to find the matching state m that maximizes the score function:

= ∩ −S T S T( , ) ( ( ) ( )) ( ( ))score f descr descr g splitsm m 11)

The Best Matching State Based on Contrast Model

Based on the similarity principles of Contrast Model, for multi-labeled graph, the more common features a matching state m has, the better the matching state is. Therefore, we propose the concept of best matching result:

Definition 7. (The best matching state) Suppose that = S SS S SS ( , , , , )V EV E Lab r r is

the source schema, ),( TTTTTT EV rr,Lab,E,V= is the target schema,

there exist a best matching state m, such that:

′≥sim simS T S T( , ) ( , )m m , ∈m M , ′ ∈m M , ′≠m m

where, sim S T( , )m and ′sim S T( , )m can be computed by Eq.7.

In other words, the best matching state is one that maximizes the common features and minimizes the distinctive features.


-45- © IBIS – Issue 1 (1), 2006

The algorithmic model of SMP

Based on Eq.10, we have the optimization function for the multi-labeled graph matching, then, we present the algorithmic model for SMP in Fig. 4. The purpose of the algorithm is to find the best matching state between two schemas.

Input: Schemas S and T Object: Find a matching state that maximize the similarity of graph G1 and G2, namely, to find the best matching state Output: The semantic correspondences between S and T 1. G1 = Multi_labeled_Graph(S ); G2 = Multi_labeled_Graph (T ); 2. Iteratively search the matching space to find a matching state m, such that

similarity(G1,G2)m is the maximum one among the matching states (Eq.10); 3. m is the matching result of G1 and G2.

Fig. 4 Algorithm model of SMP based on Multi-labeled graph matching

Examples

To illustrate the similarity computation, we take a matching state for example. For two schemas in Fig. 1, based on Fig. 2 and Table 2, we can obtain all the features of two schemas:

∪ = ∪ ∪ ∪ ∪∪ ∪ ∪

S T S T S TS T S T

( ) ( ) ( ) ( )| ( ) ( )| ( ) ( )| ( ) ( )|

concept name

type edge

descr descr descr descr descr descrdescr descr descr descr

∪ = + ⋅ + +

= + × + + =

S T( ( ) ( )) 4

21 4 21 21 19 145V name V concept Vtype Ef descr descr r r r r

,

where, = 4conceptw , = 1namew , = 1typew , = 1Ew .

Remark: For Eq.8, = 4conceptw , because the semantic matching is stronger than other label matchings [24]. Suppose that there is a mapping state of S and T:

i.e., m1= {(s1, t1), (s2, t2), (s3, t3), (s3, t4), (s4,

t5), (s7, t6), (s8, t7), (s9, t8), (s10, t9), (s11, t10)}.

Similarity of Matching State

1. Common Features of Matching State

=1m

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

1 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 00 0 1 1 0 0 0 0 0 00 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 00 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 1


-46-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

At first, we calculate the common features between two schemas based on m1. a. To compare the name labels of schemas to obtain the name features. There are many methods to measure similarity of names [7, 13, 19]. For example we use the Levenshtein distance (i.e., edit distance) to compare the name string of vertices [7], =1 1( , ) 0.875namesim s t . Suppose the threshold of name matching thname = 0.4,

the name common features of descr(S ) and descr(T ):

∩ =mS T1( ) ( )|namedescr descr {(s1, Schema S), (s2, AccountOwner), (s3, AccountOwner.Name), (s4, AccountOwner.Address), (s7, Address), (s8, Address.Street), (s9, Address.City), (t1, Schema T), (t2, Customer), (t3, Customer.Cfname), (t4, Customer.Clname), (t5, Customer.CAddress) (t6, CustomerAddress), (t7, CustomerAddress.Street), (t8, CustomerAddress.City)}

b. To obtain the common concept features, we need compare the concept labels of two vertices. We can use some semantic distances to compare similarity of two concepts, such as hso, wup, res, lin, and jcn, etc [17]. By wup [22],

=2 2( , ) 0.67conceptsim s t . If thconcept = 0.55, for m1, the intersection features of

descr(S) and descr(T ):

∩ =mS T1( ) ( )|conceptdescr descr {(s1, schema), (s2, account + owner), (s3, name), (s4, address), (s7, address), (s8, street), (s9, city), (s10, state), (s11, ZIP), (t1, schema), (t2, customer), (t3, first name), (t4, last name), (t5, address) (t6, address), (t7, street), (t8, city), (t9, province), (t10, postal code)}

Unlike the semantic matching method proposed by Giunchiglia et al. [11], our method based on the structural relations of two concepts in WordNet. The semantic matching result of two elements is a real value between 0 and 1.

c. The common type features of two schemas:

∩ =mS T1( ) ( )|typedescr descr {(s1, Schema), (s2, ElementType), (s3, element), (s4,

element), (s7, ElementType), (s8, element), (s9, element), (s10, element), (s11, element), (t1, Schema), (t2, ElementType), (t3, element), (t4, element), (t5, element) (t6, ElementType), (t7, element), (t8, element), (t9, element), (t10, element)}

d. The common edge features of descr(S) and descr(T ):

∩ =mS T1( ) ( )|edgedescr descr {(s1, s2, part-of), (s2, s3, part-of), (s2, s4, part-of), (s1, s7, part-of), (s7, s8, part-of), (s7, s9, part-of), (s7, s10, part-of), (s7, s11, part-of), (t1, t2, part-of), (t2, t3, part-of), (t2, t4, part-of), (t2, t5, part-of), (t1, t6, part-of), (t6, t7, part-of), (t6, t8, part-of), (t6, t9, part-of), (t6, t10, part-of)}

e. If we use all of the vertex and edge features of schemas together, we can get the intersection features of descr(S) and descr(T ) as follows:

∩ = ∩ ∪ ∩∪ ∩ ∪ ∩

S T S T S TS T S T

1 1 1

1 1

( ) ( ) ( ) ( )| ( ) ( )| ( ) ( )| ( ) ( )|

concept name

type edge

descr descr descr descr descr descrdescr descr descr descr

m m m

m m


-47- © IBIS – Issue 1 (1), 2006

∩ = + × + + =S T1( ( ) ( )) 15 4 19 19 17 127f descr descrm

2. Splits of Matching State The matching state m1 has the following splits:

( ){ }=1 3 3 4( ) , { , }splits s t tm = =1( ( )) | ( )| 3g splits splits mm

3. Score and Similarity of Matching State By Eq.11, the score of S and T based on m1 is:

1243)-(127),(1 ==TS mscore

Then, by Eq.10, the similarity of two schemas S and T based on m1 is:

0.8553)/145-(127),(1 ==TS mSim

Comparison of Two Matching States

Now, we expound the reason why we calculate the splits. Suppose that we add a matching couple (s7, t5) into m1, then we obtain the matching state: m2= {(s1, t1), (s2, t2), (s3, t3), (s3, t4), (s4, t5), (s7, t5), (s7, t6), (s8, t7), (s9, t8), (s10, t9), (s11, t10)}. If we do not consider the splits of m2, the common features of m2 more than m1:

∩ = + × + + =S T2( ( ) ( )) 16 4 20 19 17 132f descr descrm

Therefore, the similarity of m2 is higher than the similarity of m1. However, if we compute the splits of the matching states, then we get:

( ) ( ) ( ){ }=2 3 3 4 7 5 6 5 4 7( ) , { , } , , { , } , , { , }splits s t t s t t t s sm

0.8489)/145-(132),(1 ==TS mSim

We can see that, although the common features of m2 is greater than m1, the similarity of m2 is lower than m1, because the splits is greater than m1. In fact, we have the matching state m1 is better than m2.

Hybrid Search algorithm for Multi-labeled Graph Matching

Complete Search

As we known it, to compute the maximum similarity of labeled graphs, which is highly combinatorial, and is a NP-hard problem [2, 20]. It can be explored in an exhaustive way with a “branch and bound” approach, which is an algorithmic technique to find the optimal solution by keeping the best solution found so far. If a partial solution cannot improve on the best, it is abandoned. Such a complete approach is actually tractable if there exists a “good” bounding function that can detect as soon as possible when a node can be pruned [6], i.e., when the score of


-48-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

all the matching states that can be constructed from the current state is worse than the best score found so far. In Definition 5, we present the concept of matching matrix, and all of these matching matrixes are the matching states that constitute the matching space. Suppose the current matching state is m, all the matching states m' such that

′ − ≥ 0m m are the potential successors of m, and m' is the superset of m, ′ ⊇m m . The score function (Eq.11) is not monotonic with respect to set inclusion, i.e., the score of a mapping may either increase or decrease when one adds a new couple to it [6]. Indeed, this score is defined as a difference between a function of the common features and a function of the splits, and both sides of this difference may increase when adding a couple to a mapping. In [6], Champin and Solnon study the bounding function, for every matching state m', ′ ⊇m m :

′ ′ ′= ∩ −≤ ∪ −

S T S TS T

( , ) ( ( ) ( )) ( ( )) ( ( ) ( )) ( ( ))score f descr descr g splits

f descr descr g splitsm m m

m

If ∪ −S T( ( ) ( )) ( ( ))f descr descr g splits m is smaller or equal to the score of the best matching state mbest found so far, then the search path can be pruned. In other words, all the supersets of m will not be explored as their score cannot be higher than the best score found so far. Obviously, although the branch and bound approach can find the best matching state, it is a costly search process. Therefore, we want to find other approximate methods to solve schema matching. Local search has been applied to solve NP-hard optimization problems [1, 20, 25]. The principle of local search is to refine a given initial solution point in the solution space by searching through the neighborhood of the solution point. However, the performance of local search relies on the initial state, we need obtain a good initial state for local search. As a result, we first use a greedy matching algorithm to find a good initial state.

Greedy Matching Algorithm

The greedy strategy is a fundamental technique to solve NP-hard problem [23]. Based on [6], Zhang et al. [25] designed a greedy algorithm to solve SMP: iteratively picks the couple that most increase the score function and has the greatest looked-ahead common edge features. The algorithm stops iterating when every couple neither directly increases the score function nor has looked-ahead common edge features. Now, we discuss the computation of incremental score and the greedy strategies in detail. Here, the functions f and g are the cardinality functions, so we define the increment of common features as follows:

Δ ∪= ∩ − ∩S T S T( , )( , ) ( ( ) ( )) ( ( ) ( ))i ji j s tdescr s t f descr descr f descr descrm m m 12)


-49- © IBIS – Issue 1 (1), 2006

Δ

= =

= =

= =

= =

⎧+ = =⎪

⎪⎪ + ≥ =⎪≤ ⎨⎪ + = ≥⎪⎪ + ≥ ≥⎪⎩

∑ ∑

∑ ∑

∑ ∑

∑ ∑

, ,1 1

, ,1 1

, ,1 1

, ,1 1

1, 1

0 2, 1( , )

0 1, 2

0 0 2, 2

n k

i c i cc cn k

i c i cc c

i j n k

i c i cc cn k

i c i cc c

e e m m

e m mdescr s t

e m m

m m

m 13)

where, e is the number of vertex and edge features (i.e., the number of vertex and edge labels), = + + +name concept type Ee w w w w . For the example in section 5.7,

= + + + =1 4 1 1 7e . If all the features of si and tj are all matched, the Δ = +( , )i jdescr s t e e ; If partial of features of si and tj are matched, then Δ ≤ +( , )i jdescr s t e e .

At the same time, the splits will increase with ∈ ×S T( , )i js t V V enter m. We evaluate the increment of splits as follows:

Δ = ∪ −m m m( , ) ( ( ( , ))) ( ( ))i j i jsplits s t g splits s t g splits 14)

Since g is the cardinality function, we show the evaluation of splits in Eq.15:

Δ

= =

= =

= =

= =

= =

= = =

= = =

= = ==

> = =

= > =

∑ ∑

∑ ∑

∑ ∑

∑ ∑

∑ ∑

, , ,1 1

, , ,1 1

, , ,1 1

, , ,1 1

, , ,1 1

0 1, 1, 1

3 2, 1, 1

3 1, 2, 1( , )

1 2, 1, 1

1 1, 2, 1

2

k n

i c c j i jc ck n

i c c j i jc ck n

i c c j i jc c

i j k n

i c c j i jc ck n

i c c j i jc c

m m m

m m m

m m msplits s t

m m m

m m m

m

= =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪ > > =⎪⎩

∑ ∑, , ,1 1

2, 2, 1k n

i c c j i jc c

m m m

15)

Based on Eq.9 and Eq.11, we get the incremental score:

Δ Δ Δ = − ∪( , ) ( , ) ( ( , ))i j i j i jscore s t descr s t splits s tm m m m 16)

As the complete search methods are not feasible, by Eq.16, we can design the greedy search strategies:

1. The matching candidates are the zero elements in the current matching state m. If =, 0i jm , and Δ ( , )i jscore s tm is the maximum one among the matching candidates, we will let =, 1i jm , i.e., ( , )i js t will enter m.

2. In addition, if there are several matching candidates have the maximum score value, then we will choose the one which has more potential common edge features, i.e., _ ( , )i jlook ahead s t is the maximal one.


-50-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

= ∈ ∃ ∈ ∈ ∪

∈ ∃ ∈ ∈ ∪

∈ ∃ ∈ ∈

S T

T S

S T

T

S

T

_ ( , ) {( , , ) | , ( , , ) }

{( , , ) | , ( , , ) }

{( , , ) | , ( , , )

i j i jE E

j iE E

i jE E

look ahead s t s s l r t V t t l r

t t l r s V s s l r

s s l r t V t t l r

∪

∪

∈ ∃ ∈ ∈− ∩

T SS

S T{( , )}

}

{( , , ) | , ( , , ) }

( ) ( )i j

j iE E

m s t

t t l r s V s s l r

descr descr

17)

The look_ahead means that this matching couple will increase the common edge features, therefore, we pick the one has the greatest looked-ahead value.

3. If there are several matching couples both have the maximum score and look_ahead value, then the algorithm can pick a matching couple randomly in the matching candidates.

In [25], the authors used an example to present the greedy matching process in detail. For this greedy algorithm, the computations of the f function has a polynomial time complexity of ×S T 2((| | | |) )O V V ; g functions has a linear time complexities with respect to the size of the schemas (Max(| |,| |)O V VS T ; The computation of “look ahead” sets has a polynomial time complexity of ×S T(| | | |)O V V , and can be computed in an incremental way [6]. Therefore, the greedy algorithm has a polynomial time complexity of ×S T 2((| | | |) )O V V . Just as the greedy algorithm of knapsack problem, at each step of search process, the greedy algorithm only iteratively selects the matching couple which make the score function is the maximum one. The algorithm only tracks one search path and do not compare with matching results that obtained by the other search paths. Therefore, the greedy algorithm is not a complete algorithm. Based on [25], We design a local search algorithm to improve the matching result.

Local Search for Multi-labeled Graph Matching

Local search is class of effective approximation algorithms for combinatorial optimization [1], which tries to improve a solution by locally exploring its neighborhood. The neighbors of m are the mapping states that can be obtained by adding or removing one couple of vertices to m, ∀ ∈m M . The size of ( )N m is

× = ×1n kC n k .

==

= =

⎧ ⎫⎪ ⎪′ ′ ′ ′= − = − ≤ ∈⎨ ⎬⎪ ⎪⎩ ⎭

∑∑ , ,1 1

( ) : 1, j ki n

i j i ji j

N m mm m m m m M 18)

If we define l elements of current state m can be changed at the same time, then we obtain the neighborhood:

==

= =

⎧ ⎫⎪ ⎪′ ′ ′ ′= − = − ≤ ∈⎨ ⎬⎪ ⎪⎩ ⎭

∑∑ , ,1 1

( ) : , j ki n

l i j i ji j

N m m lm m m m m M 19)


-51- © IBIS – Issue 1 (1), 2006

Therefore, the size of ( )lN m is × × ×+ + ⋅ ⋅ ⋅ +1 2 ln k n k n kC C C . Fig. 5 shows the local search

algorithm, which randomly select a matching couple in ( )N m .

function S TLocal_Search( , ) begin

k ← 0; m ← S TGreedy( , ) ; M ← ( )N m ; ←bestm m ; while ( k++ < Max_iteration ) do /* search */

if >sim simS T S T( , ) ( , )bestm m then best ←m m ;

end if choose randomly ∈m M ;

← −M M m ; if = ∅M then return bestm ;

end return bestm ;

end

Fig. 5 Local Search for Schema Matching

Provided that we want to obtain all the feasible mapping states for users, we can set a threshold of schema similarity to obtain the matching states which greater than the given similarity value. For instance, given a matching state m, if

≥S T( , ) thm simsim , then m is a possible matching result.

Evaluation of Algorithm

Experimental Design

We have carried out some experiments to evaluate our approach. We tested the hybrid algorithm on seven samples: Biztalk (Fig. 1), Library (XML) [14], University (XML) [14], Property Listing (XML) [14], Purchase order (relational & XML) [8], Financial (XML) [27], Student (XML) [27]. The seven schemas are classified into three different kinds:

1. matching of XML schemas (Biztalk, Library, Property Listing, Financial, Student)

2. matching of XML schemas using XML data instances (University)

3. matching of relational schema and XML schemas (Purchase order)

For these samples, Table 3 shows the scales of them, includes the numbers of vertex, edge, and the depths of schema structure:


-52-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

Scale of S Scale of T vertex edge depth vertex edge depth

Biztalk 11 10 3 10 9 3 Library 15 14 3 16 15 3 University 10 9 3 7 6 3 Property 12 11 4 13 12 4 Purchase

d 13 12 2 9 8 3

Financial 14 13 3 14 13 5 Student 18 17 4 15 14 6

Table 3 The number of vertex and edge

Parameter Tuning

Because schema matching is a heuristic operation, we should use some meta-strategies to change the parameters of similarity evaluation function during the search process. The optimization function (Eq.10) is very important to achieve the optimal matching result. To obtain the optimal matching results, we should adjust the functions f and g, i.e., tune different weights of Eq.8 and Eq.9, including the weight of splits w', the weight of vertex wname, wconcept and wtype, and edge features wE. First, for function f, the concept feature has greater weight than name and type feature. If the concepts of two vertices are matchable, then the matching probability of two vertices is higher than only name or type matchable. For example, if = 4conceptw , ′ = 1w , the greedy algorithm will obtain the couples (s3, t3) and (s3, t4), however, if ≤ 2conceptw , ′ = 1w , the algorithm cannot obtain both (s3, t3) and (s3, t4), and will obtain (s5, t4). Second, the function g determines the number of multivalent mappings. The greater weight of g is, the more difficult to obtain multivalent mapping. For example, if = 4conceptw , ′ = 3w , then the algorithm cannot obtain many-to-many matching, and will obtain one-to-one mapping result. By different weights, we obtain a reasonable proportion of functions of f and g, and then we can obtain the desired multivalent correspondences. Table 4 shows the weights of f and g that are used in greedy and local search algorithms.

Schema wna wconcept wtype wE w'

Biztalk/Purchase order 1 4 1 1 1

Library/Financial 1 4 2 1 1.5

University/Property 1 4 1 2 1.5

Student 1 4 2 1 1

Table 4. The weights of function f and g

In fact, we only consider the three features of schemas (i.e., name, concept, and type) in this paper, and in the near future, we can consider the more features of schemas for matching. Therefore, we should modify the function f and g, and adjust the weights of different features.


-53- © IBIS – Issue 1 (1), 2006

The similarity measure of schemas based on Contrast Model and the multi-labeled graph is an open framework for schema matching. In light of different applications, we can choose appropriate features to describe schemas and encode these features as the labels of multi-labeled graph, then, we can use the hybrid matching method to obtain the desired matching result.

Experimental Results

We evaluate the “accuracy” of the algorithm by counting the number of needed adjustments, therefore, in this paper, provided that the best matching state of two schemas is fixed in the matching tests. Under these conditions, the performance of hybrid search algorithm is very well. Fig. 6 shows the average Precision of seven matching samples by the hybrid algorithm, the total average Precision is 87%.

Fig. 6 Average Precision of matching samples by hybrid search

Fig. 7 Average Recall of matching samples by hybrid search

Moreover, the algorithm achieves the total average Recall nearly of 97.2% (See Fig. 7), and total average Overall of 83.6%. Fig. 8 presents average Precision, Recall, and Overall of samples by the hybrid algorithm.


-54-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

Fig. 8 Average quality of matching samples by hybrid search

The hybrid algorithm has been implemented in Visual C++. Experiment settings: P4 2.4G, 224M DDR RAM. On these tests, the algorithm is very fast and obtains feasible matching results. For seven samples, Table 5 shows iteration times and average running time of greedy algorithm.

Biztalk Library University Property Purchase order Financial Student

Iteration 12 20 10 21 9 18 25

Time (s) 0.310 0.712 0.045 0.325 0.312 0.452 0.798

Table 5. The average iterations of greedy algorithm

Table 6 shows the total average running times of hybrid algorithm, where, the maximum iteration of local search is 5000, i.e., Max_iteration = 5000 (see Fig. 5).

Biztalk Library University Property Purchase order Financial Student

Time (s) 0.680 1.846 0.357 1.003 0.640 1.738 1.873

Table 6. The total average times of hybrid search algorithm

The similarity measure of multi-labeled graph can combine all the properties and features of two schemas, especially the matching method considers the edge features between two schemas, therefore, the matching performance is better than existed prototypes. To compare with Similarity Flooding [14], Automatch [3], and LSD [10], the average “accurate” is higher than these matching methods. In particular, for Biztalk, Library, University, and Property Listing, the matching results are better than Similarity Flooding. In addition, the time cost of our algorithm is lower than other algorithms. By the multi-labeled graph matching, the algorithm model can implement not merely instance-level matching (University), but also schema-level matching (Biztalk, etc.). Moreover, by our matching method, the users can obtain element-level n:m matching result. If we tune the parameters of optimization function, the matching algorithm can obtain different matching results for users.


-55- © IBIS – Issue 1 (1), 2006

A Comparison between Greedy Matching and Hybrid Search

We use an experiment to compare the performance of greedy matching and hybrid search. For seven samples in section 7.1, Fig. 9 shows the average Precision, Recall, and Overall of samples by using greedy matching algorithm. To compare with the matching result of hybrid algorithm (see Fig. 8), we can see that the quality of matching results is improved by local search. The comparison graph between greedy matching and hybrid search is shown in Fig. 10.

Fig. 9 Average quality of matching samples by greedy matching

Fig. 10 Comparison of greedy matching and hybrid search

Comparative Results and Conclusions

Comparative Results

In Table 7, we compare the characteristics of five published matching methods with our Multi-labeled Graph Matching method.


-56-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

Similarity

Flooding [14] LSD [10] Cupid [13] COMA [8] S-Match [11]

Multi-labeled Graph

Matching Tested schema

types XML, relational XML XML, relational XML XML XML, relational

Metadata representation

Directed labeled graph

XML schema trees

extended ER DAG tree Multi-labeled

graph

Match granularity

element / structure level



element /

structure level element /

structure level

Match cardinality

n:m 1:1 1:1 and n:1 1:1 1:1 n:m

Combination of matchers

hybrid

Composite matcher with

automatic combination of matcher results

hybrid hybrid, composite semantic matcher

hybrid

Manual work / user input

user can adjust threshold weights

user-supplied matches for

training sources; user can specify

tuning parameters and

integrity constraints to

guide selection of match

candidates

user can adjust threshold weights

user-Feedback matcher to

capture match and mismatch

information provided by the user including

corrected match results from the previous match

iteration.

-

user can adjust weights of objective function,

threshold weights, choose intial

matching candidates

Schema level match

syntactic syntactic syntactic syntactic semantic syntactic / semantic

Instance level matchers

syntactic syntactic - - - syntactic / semantic

Reuse / auxiliary

information used

thesauri, glossaries

comparison with training

matches; lookup for valid domain

values

thesauri, glossaries

reuse, thesauri, glossaries

WordNet, thesauri, glossaries

WordNet, thesauri, glossaries

Pre-match effort

-

training, specifying domain

synonyms, constraints

specifying domain

synonyms

specifying domain synonyms

WordNet specifying domain

synonyms, WordNet

Subjectivity 7 users 1 user 1 user 1 user 1 user 5 users

Application area

metadata management

data integration with pre-defined global schema

data translation, but intended to be

generic

integration of web data sources, data warehouse loading and XML message

mapping

semantic integration

XML message mapping, semantic

integration

Employed quality

measures Overall Recall -

Precision, Recall, Overall

Precision, Recall,

Overall, F-measure

Precision, Recall, Overall

Precision - ~0.8 - 0.93 ~1.0 ~0.87

Recall - 0.8 - 0.89 ~0.88 ~0.97

Overall ~0.6 ~0.6 - 0.82 ~0.88 ~0.83

F-measure ~0.94

Table 7. Characteristics of proposed schema match approaches

Conclusion and Future work

In this paper, we focus on how to formulize SMP as a combinational optimization problem, and study the approximate matching algorithm to solve this optimization problem. We show the definition of multi-labeled graph at first. Therefore, we can


-57- © IBIS – Issue 1 (1), 2006

transform SMP into a multi-labeled graph matching problem. We present a similarity measure of multi-labeled graph based on Contrast Model, and we propose the best matching result based on features of two schemas. Then, by the objective function of multi-labeled graph matching (Eq.10), we discuss the branch and bound method briefly, which is a complete algorithm for SMP. Because it is a costly method, we propose a hybrid matching algorithm to solve graph matching problem, which combine the greedy matching algorithm and local search together. The experimental results confirm that the hybrid algorithm is effective. In fact, we mainly use three kinds of label features, i.e., name, concept, type, and one kind of relation, i.e., part-of. Nevertheless, the other features also can be labeled to vertices and edges of multi-labeled graph. Our matching method also solves the extended multi-labeled graph effectively and easily. The multi-labeled graph model can integrate all of available features of schemas flexibly. Therefore, at first, we will use all features together to obtain more accurate matching result, such as data types and value ranges, uniqueness, optionality, relationship types and cardinalities, etc. Secondly, we will design some meta-heuristic strategies to tune the weights of functions during the search process. We also can introduce the fuzzy strategies to adjust the weights of Eq.8 and Eq.9. So we can find a desired matching state fast and accurately. Thirdly, for large-scale schema matching (XML, relational schema, etc.), we will design sub-labeled graph matching methods. We can use schema segmentation to obtain subschemas at first, and then use subgraph matching algorithms to achieve subschema matching. Moreover, we are going to design incremental algorithms for large-scale schema matching, and design reuse framework for schema matching based on CBR model.

Acknowledgment

This work was supported by the National 973 Information Technology and High-Performance Software Program of China under Grant No.G1998030408.

References

[1] E. Aarts, J. K. Lenstra. Local Search in Combinatorial Optimization. John Wiley & Sons, Chichester, 1997.

[2] E. Bengoetxea. Inexact Graph Matching Using Estimation of Distribution Algorithms. Ecole Nationale Supérieure des Télécommunications. 2002, PhD thesis.

[3] J. Berlin, A. Motro, Database Schema Matching Using Machine Learning with Feature Selection. LNCS 2348: 452-466.

[4] V. Blondel, L. Ninove, P. V. Dooren, Convergence of graph similarity algorithms, Proceedings of the 23rd Benelux Meeting on Systems and Control, Helvoirt, The Netherlands, paper FrP06-4, March 17-19, 2004.

[5] P. Bouquet, B. Magnini, L. Serafini, S. Zanobini, A SAT-Based Algorithm for Context Matching. LNAI 2680: 66-79.

[6] P. A. Champin, C. Solnon. Measuring the similarity of labeled graphs. Springer-Verlag, ICCBR 2003, LNAI 2689: 80–95.

[7] W. W. Cohen, P. Ravikumar, S. E. Fienberg, A Comparison of String Distance Metrics for Name-Matching Tasks. 2003, IJCAI-03: 3-78.


-58-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

[8] H. H. Do, E. Rahm. COMA - A system for flexible combination of schema matching approaches. VLDB 2002.

[9] H. H. Do, S. Melnik, E. Rahm. Comparison of schema matching evaluations. LNCS 2693: 221–237.

[10] A. Doan, P. Domingos, A. Halevy, Learning to Match the Schemas of Data Sources: A Multistrategy Approach. Machine Learning. Kluwer Academic Publishers, 2003 (60):279-301.

[11] F. Giunchiglia, P. Shvaiko, M. Yatskevich. S-Match: An algorithm and an implementation of semantic matching. In Proceedings of ESWS'04.

[12] R. L. Goldstone. Similarity. MIT encylopedia of the cognitive sciences. Cambridge, MA: MIT Press, 763-765.

[13] J. Madhavan, P. A. Bernstein, E. Rahm, Generic Schema Matching with Cupid. 27th VLDB Conference.

[14] S. Melnik, Generic model management - concepts and algorithms, Springer, 2004, LNCS 2967. [15] S. Melnik, H. Garcia-Molina, E. Rahm, Similarity Flooding: A Versatile Graph Matching

Algorithm. ICDE 2002. [16] R. J. Miller, L.M. Haas, M.A. Hernández, Clio: Schema Mapping as Query Discovery. VLDB

2000. [17] T. Pedersen, S. Patwardhan, S. Patwardhan, WordNet::Similarity - Measuring the Relatedness

of Concepts. Proceedings of the Nineteenth National Conference on Artificial Intelligence, 2004, San Jose, CA.

[18] E. Rahm, P. A. Bernstein, On matching schemas automatically. Microsoft Research, Redmon, WA. Technical Report MSR-TR-2001-17, 2001.

[19] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching. The VLDB Journal, 2001(10):334-350.

[20] S. Sorlin, C. Solnon: Reactive Tabu Search for Measuring Graph Similarity. GbRPR 2005: 172-182.

[21] A. Tversky. Features of similarity. Psychological Review, 84, 327-352. [22] Z. B. Wu, M. Palmer. Verb semantics and lexical selection. In Proceedings of the 32nd Annual

Meeting of the Association for Computational Linguistics, 1994, 133-138. [23] V. V. Vazirani, Approximation Algorithms, Springer-Verlag, Berlin, 2001. [24] Z. Zhang, H. Y. Che, P. F. Shi, Y. Sun, J. Gu. An algebraic framework for schema matching.

WAIM 2005, LNCS 3739. [25] Z. Zhang, H. Y. Che, P. F. Shi, Y. Sun, J. Gu. Multi-labeled graph matching - An algorithm

model for schema matching. ASIAN'05. [26] Z. Zhang, H. Y. Che, P. F. Shi, Y. Sun, J. Gu. Schema homomorphism – An algebraic

framework for schema matching. ASIAN'05. [27] http://www.almaden.ibm.com/software/km/clio/clioxmldemo.shtml


-59- © IBIS – Issue 1 (1), 2006

Zhi Zhang received the B.S. and M.S. degrees in Mechanical Engineering from Sichuan University in 1998 and Southwest Jiaotong University in 2002, respectively. He is currently working towards the Ph.D. degree in Pattern Recognition and Intelligent system with Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, China. His research interests include Metadata Management, Semantic Interoperability, Information/Data Integration, and Approximation Algorithm.

Haoyang Che received the B.S. and M.S. degrees in Electrical Engineering from Beijing Normal University in 1998 and 2001, respectively. He is currently working towards the Ph.D. degree in Computer Science with Institute of Software, the Chinese Academy of Sciences, China. His research interests include trust management, software engineering, and P2P networks.

Pengfei Shi received the Bachelor’s and Master’s degrees in electrical engineering from Shanghai Jiao Tong University (SJTU), Shanghai, China, in 1962 and 1965, respectively. In 1980, he joined the Institute of Image Processing and Pattern Recognition (IPPR), SJTU. During the past 23 years, he worked in the area of image analysis, pattern recognition, and visualization. He has published more than 80 papers. He is currently the director of the Institute of IPPR at SJTU and a professor of pattern recognition and intelligent systems on the Faculty of Electronic and Information Engineering. He is a senior member of the IEEE.

Jun Gu received the BS degree in electrical engineering from the University of Science and Technology of China in 1982 and the PhD degree in computer science from the University of Utah in 1989. He has been the associate editor-in-chief of the IEEE Computer Society Press Editorial Board, an associate editor of the IEEE Transactions on Knowledge and Data Engineering, the IEEE Transactions on VLSI Systems, the Journal of Global Optimization, the Journal of Combinatorial Optimization, and the Journal of Computer Science and Technology, and is on the

advisory board of International Book Series on Combinatorial Optimization. He was a chair of the 1995 National Academy of Sciences Information Technology Forum and was a chair of the 1996 National Science Foundation special event in celebration of 25 years of research on the satisfiability problem. He is a member of


-60-

IBIS – Issue 1 (1), 2006

© IBIS – Issue 1 (1), 2006

the ACM, the ISA, the ISTS, the INNS, a senior member of the IEEE, and a life member of the AAAI.


-61-© IBIS – Issue 1 (1), 2006

The Connection, Communication, Consolidation, Collaboration Interoperability Framework (C4IF) For

Information Systems Interoperability Vassilios Peristeras, Greek National Center for Public Administration, [email protected]

Konstantinos Tarabanis, University of Macedonia, [email protected]

Abstract: In this paper, we review the growing literature of interoperability typologies for information systems and propose an interoperability typology framework to provide a synthesis. For this we employ a metaphor from linguistics. That is, we perceive the interaction amongst information systems as a discourse and use concepts from linguistics to outline different types of information systems interoperability. The derived framework has been named Connection, Communication, Consolidation and Collaboration Interoperability Framework (C4IF).

INTRODUCTION

The term “interoperability” has been heavily used - and sometimes misused – in the information systems literature. In this paper, we elaborate on the definition, the scope, the use and the types of interoperability. First, we present a set of definitions as given by established actors and give an overview of several interoperability typologies as introduced by various researchers in the field. We then propose our own classification framework of all interoperability types, which we call C4 Interoperability Framework (C4IF) from the initials of the words Connection, Communication, Consolidation, and Collaboration that constitute the core concepts of our framework. Last, we place the existing typologies vis-à-vis the C4IF framework and present its advantages when compared with the other approaches.

Defining interoperability

There are numerous definitions for interoperability in the literature. Instead of trying to add one more, we quote some of them: • IEEE defines it as the ability of two or more systems or components to exchange

information and to use the information that has been exchanged1. • Quoting from the IDABC, European Interoperability Framework, interoperability

means the ability of information and communication technology (ICT) systems and of the business processes they support to exchange data and to enable sharing of information and knowledge2.

• DARPA defines interoperability as (a) the ability of systems, units, or forces to provide services to and accept services from other systems, units or forces and

1 IEEE (1990), IEEE (Institute of Electrical and Electronics Engineers): Standard Computer Dictionary- A Compilation of IEEE Standard Computer Glossaries 2 IDABC (2004), European Interoperability Framework


-62- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

to use the services so exchanged to enable them to operate effectively together, (b) the condition achieved among communications-electronics systems or items of communications electronics equipment when information or services can be exchanged directly and satisfactorily between them and/or their users3.

• TOGAF (The Open Group Architecture Framework) defined interoperability as: ‘(1) the ability of two or more systems or components to exchange and use shared information, and (2) the ability of systems to provide and receive services from other systems and to use the services so interchanged to enable them to operate effectively together’4.

• Vernadat defines interoperability as the ability to communicate with peer systems and to access their functionality5.

Interoperability typologies

In this part, we present several classification frameworks that have been proposed to group together different types and aspects of interoperability. These typologies better clarify the term. A common feature identified in all typologies is an explicit or implicit evolutionary perspective. This means that there is an assumption that the various interoperability types follow a scale of advancement: the higher a type is placed in the scale, the more advanced the achieved interoperability is considered. For this reason, the interoperability types are sometimes called “levels”. As a result of the above observation, in the majority of these typologies there is a strict linearity. To reach an upper level of interoperability advancement, all the previous levels have to be successfully addressed. There are cases though, where the linearity is looser. This means that certain features of an upper interoperability type may become available without fully addressing all the lower interoperability levels. For example, the organizational interoperability layer as introduced by the European Interoperability Framework exhibits a loose linearity with regards to the proposed lower semantic and technology interoperability layers.

A short presentation of twelve interoperability typologies follows.

1) DARPA presented the Levels of Information System Interoperability (LISI) capabilities model [1] where a matrix structure was introduced with five interoperability maturity levels affecting four interoperability attributes. The levels introduced by LISI are the following: • Isolated Systems: No physical connection exists (manual). • Connected Systems: Electronically connected; separate data applications;

homogeneous data exchange is possible (peer-to-peer). • Distributed Systems: Minimal common functions; separate data & application;

heterogeneous data exchange is possible (functional).

3 Federal Standard 1037C, Department of Defense Dictionary of Military and Associated Terms in support of MIL-STD-188. 4 Open Group (2000), TOGAF: The Open Group Architecture Framework, Document No. 1910, Version 6. 5 Vernadat, F.B. (1996) Enterprise Modelling and Integration: principles and applications; Chapman & Hall, ISBN 0 412 60550 3


-63-© IBIS – Issue 1 (1), 2006

• Domain Systems: Shared data but separate applications; sophisticated collaboration (integrated)

• Enterprise Systems: Enterprise wide shared systems; advanced collaboration; interactive manipulation of shared data & applications (universal).

The attributes defined in LISI and affected by the above-presented maturity level are: • Procedures • Applications • Infrastructure • Data

2) Within the context of the NATO C3 Technical Architecture (NC3TA) [2], the NC3TA Reference Model for Interoperability (NMI) is used. NMI uses the following categories: • No Data Exchange: No physical connection exists • Unstructured Data Exchange: Exchange of human-interpretable, unstructured

data (free text) • Structured Data Exchange: Exchange of human-interpretable structured data

intended for manual and/or automated handling, but requiring manual compilation, receipt and/or message dispatch

• Seamless Sharing of Data: Automated data sharing within systems based on a common exchange model

• Seamless Sharing of Information: Universal interpretation of information through cooperative data processing

3) The Levels of Conceptual Interoperability Framework (LCIF) [3] defines five levels focusing on the data to be interchanged and the interface documentation, which is available: • 0-System Specific Data: No interoperability between two systems. Data are seen

as a resource of the system, not meant to be shared with other systems. • 1-Documented Data: Data is documented using a common protocol, • 2-Aligned Static Data through metadata management: Data is documented using

a common reference model based on a common ontology, common or shared reference models, and standardized data elements. However, the same object model can be used slightly or completely differently by different systems.

• 3-Aligned Dynamic Data: The use of the data within the federate/ component is well defined using standard software engineering methods such as UML.

• 4-Harmonized Data and Processes: Semantic connections between data that are not related concerning the execution code is made obvious by documenting the conceptual model underlying the component. The systems model the same part of the real world and the same relationships.

4) J. Park and S. Ram [4] identified conflicts at (a) the data-level caused by multiple representations and interpretations of similar data (e.g. data-value, data representation, data precision and object versus attribute conflicts) and (b) the schema-level characterized by differences in logical structures and/or inconsistencies in metadata (i.e., schemas) of the same application domain (e.g. conflicts in naming, entity-identifiers, schema isomorphism, generalization, aggregation, schematic discrepancies).


-64- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

5) Brutzman and Tolk [5] presented five levels of system interoperability: • technically connected (technical level); • use the same protocols to exchange data (syntactical level); • know the context of the data in the form of unambiguous definitions of the

entities, attributes and relations (semantic level); • know how the information will be used when being transferred to a component

(pragmatic level) and; • know the functionality of the component within the common conceptual view of

the world to ensure that assumptions and constraints are taken into account respectively (conceptual level).

6) MITRE [6], [7] has presented a matrix structure in order to document all types of interoperability mismatches. In one dimension six levels of interoperability are presented: • Data • Object • Application • System • Enterprise • Community

These levels are then positively correlated to three kinds of Integration: • Syntactic • Structural • Semantic

Taxonomies are provided as examples of syntactic integration, database schemas of structural integration and theory of logic for semantic interoperability. Interestingly, semantic explicitness is positively linked to looseness of coupling. Thus, historically we move from tightly coupled to loosely coupled systems.

7) MITRE again presented [8] another framework for information interoperability, which defines four “problems levels”: • Level 1: Overcome geographic distribution (infrastructure heterogeneity). • Level 2: Match semantically compatible attributes. Some independently

developed information systems use the same terms for the same concepts, but many don’t.

• Level 3: Mediate between diverse representations. Integrators must often reconcile different representations of the same concept.

• Level 4: Merge instances from multiple sources, through data correlation and data-value reconciliation (sometimes called fusion).

Two main types of information interoperability have been introduced: • Exchange, in which a producer provides information to a consumer and the

information is transformed to suit the consumer’s needs (levels 1-3). • Integration, in which in addition to being transformed, information from

multiple sources is also correlated and fused. In general, the consumer sees a single, coherent view rather than all the different systems’ views (level 4).

8) Clark and Jones in [9] proposed an Organisational Interoperability Maturity Model. The model defines the levels of organisational maturity that describe the


-65-© IBIS – Issue 1 (1), 2006

ability of organisations to interoperate. Five levels were identified, closely aligned with the descriptions of the LISI model. • Unified: a unified organisation is one in which the organisational goals, value

systems, command structure/style and knowledge bases are shared across the system.

• Integrated: The integrated level of organisational interoperability is one where there are shared value systems and shared goals, a common understanding and a preparedness to interoperate. For example, detailed doctrine is in place and there is significant experience in using it.

• Collaborative: The collaborative organisational interoperability level is where recognised frameworks are in place to support interoperability. Shared goals are recognised and, roles and responsibilities are allocated as part of on-going responsibilities, however the organisations are still distinct.

• Ad hoc: At this level of interoperability only very limited organisational frameworks are in place, which could support ad hoc arrangements.

• Independent: This level describes the interaction between independent organisations.

9) Klischewski in [10] identifies and discusses two types of integration: • information integration aims at facilitating information flow, i.e. providing

access to structured informational resources across technical and organisational borders in order to enable new services based on a virtually shared information environment.

• process integration pertains to interrelating steps and stages of process performance across technical and organisational borders in order to enable new services based on an overarching monitoring and control of process flow.

10) The European Interoperability Framework (ref EIF) published by IDABC recognizes three interoperability levels: • Technical, linking computer systems and services. • Semantic, ensuring that the precise meaning of exchanged information is

understandable by any other application that was not initially developed for this purpose.

• Organizational, defining business goals, modeling business processes and bringing about the collaboration of administrations [11].

11) Medjahed [12] adopts a similar to the previous interaction model, which consists of three layers: • Communication: Protocols for exchanging messages among remotely located

partners. • Content: Languages and models to describe and organize information in such a

way that it can be understood and used. • Business Process: Enable autonomous and heterogeneous partners to engage in

peer-to-peer interactions with each other.

An additional set of parameters defines how applications interact on the Web. This set is applicable to enabling technologies and prototypes, and consists of the following parameters: coupling, autonomy, heterogeneity, external manageability, adaptability, security and scalability.


-66- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

12) Mylopoulos and Papazoglou [13] in their seminal work introducing the notion of Cooperative Information Systems identified among others, two broad categories of challenges to IS design and development: • Interoperation. This category covers topics such as generic, open architectures,

distributed object management, network-centric computing, compartmentalized applications, factoring out global control from individual components, integration of user and subsystem communication, communication protocols, translation mechanisms, data-integration mechanisms, semantic metadata repositories, knowledge sharing and blackboard architectures.

• Coordination. The topics in this category include computer-supported collaborative work, synchronous and asynchronous sharing, virtual workspaces, performers and customers, concurrency control, transaction management, mediation architectures, workflow systems, AI planning, multiagent technologies, intelligent scheduling, self-describing systems and reflective architectures.

The C4 Interoperability Framework (C4IF)

In this part, we propose our own typology for information systems interoperability. We have called it the Connection, Communication, Consolidation, Collaboration Interoperability Framework (C4IF). The C4IF has been developed using some well-defined concepts from linguistics. Following a language/action perspective (e.g. [14]), we focus on the ways information systems communicate, modeling this communication as a discourse. To better understand, analyze and study this, we employ from language theories well-elaborated concepts such as the language form, syntax, meaning and use of symbols and interpretations. Interestingly, these issues are considered common to all kind of communications. The specific type of communication that interests us here is that amongst Information Systems. Thus, we transfer basic linguistics concepts to the domain of IS communication.

Linguistics is the study of language and this latter covers a wide range of phenomena: sounds (phonetics and phonology), word formation and word endings (morphology), word combinations (syntax), meaning (semantics) and language use (pragmatics).

• Phonetics is the study of the different sounds that are employed across all human languages.

• Phonology is the study of patterns of a language's basic sounds. • Morphology is the study of the internal structure of words. • Syntax is the study of how words combine to form grammatical sentences. • Semantics is the study of the meaning of words (lexical semantics), and how

these combine to form the meanings of sentences. • Pragmatics is the study of how utterances are used (literally, figuratively, or

otherwise) in communicative acts.

We transfer the above concepts to Information Systems communication and use these broad categories to build our interoperability typology, namely the C4IF.

The framework defines four interoperability types. These are the followings:


-67-© IBIS – Issue 1 (1), 2006

• Connection • Communication • Consolidation • Collaboration

A short description of these types follows:

Connection refers to the ability of information systems to exchange signals. To succeed in this, a physical contact/connection should be established between two (or more) systems. In a linguistic analogy, this level guarantees that phonemes can be uttered and received by the two interlocutors.

Communication refers to the ability of information systems to exchange data. To succeed in this, a predefined data format and/or schema need to be accepted by the interlocutors. The focus of this type is on the data content. At least two levels of communication can be considered:

• At the first level, the exchange is based on a commonly accepted data format. Agreement on the data format is needed between the interlocutors. The focus is on each separate data string (e.g. date=dd/mm/yyyy). We call this type of interoperability Morphological/Structural Communication. • At the second level and more advanced level, the exchange includes data, which is placed in commonly accepted and agreed data syntax/schemas (e.g. Entity-Relationship diagram, XML Schemas). Agreement on the data syntax/schema is needed between the interlocutors. The focus is extended from the separate data string to the data syntax/schema. We call this type of interoperability Syntactic Communication.

Consolidation refers to the ability of information systems to understand data. To succeed in this, a commonly accepted meaning for the data needs to be established between the interlocutors (e.g. a reference ontology). The focus here is on the data meaning, interpretation and semantics.

Collaboration refers to the ability of systems to act together. Action results in changes in the real world. To succeed in this, a commonly accepted understanding for performing functions/services/processes/actions needs to be established between the interlocutors or information systems in our case (e.g. agreed distributed workflow patterns). The focus is on the process, the behaviour and the use of data. Paraphrasing Austin’s seminal work in Speech Act Theory, “How to do things with Words” [15], we may say that this layer in information systems interoperability refers to “How to do things with Information”.

These four interoperability types are organized in three demarcated areas based on the characteristic object of the achieved interaction. The first and last types constitute separate areas alone, while the second and third types together form the third area. The objects of integration, which formulate each area are:

• Channel: this refers to the connection layer and the ability of information

systems to exchange signals. • Information: this refers to the communication and the consolidation

layers, and the ability of information systems to exchange data and information. Actually a continuum is defined with regard to this item: we may have morphological/structural (data format), syntactic (data schema) and/or semantic (data meaning) interoperability. The


-68- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

“continuum” character implies that these three types constitute analytical abstractions. In between these types several intermediate states can be identified. The first two types clearly refer to the communication layer, while the third to the consolidation layer.

• Process: this refers to the collaboration layer and the ability of information systems to act together.

The relationship amongst these three areas within the C4IF interoperability types is depicted in Fig. 1. These three areas are considered disjoint to a great extent. This means that the C4IF allows for example advanced level of communication/consolidation with low IS interoperability available at the connection level, and/or advanced level of collaboration with low interoperability at all other levels, and the reverse. In other words, although advancements in one area usually provide the enabling infrastructure for advancements in another area, each area can be considered autonomous and may evolve separately. As an example, passing from a situation where data is exchanged manually between two IS (e.g. using disks) to an advanced mode of exchanging data via a wireless broadband network is clearly a substantial advancement occurring in the channel area and at the connection level of IS interoperability. However, this advancement alone does not guarantee any interoperability advancement in the other areas/levels (e.g. data/communication, process/collaboration). Similarly, agreeing on a common terminology to be used for describing the various entities modeled in the information systems of a group of organizations constitutes a remarkable advancement in the information exchange area (consolidation interoperability level). But this does not automatically lead to advancements in the process area (collaboration interoperability between these organizations).

Fig. 1: The C4IF

The analogy to the above-presented linguistic concepts is presented in Table 1.

Language Information Systems Communication

Substance Focus

Phonetics Phonology

Connection Channel Communication channel

Morphology Data Format Syntax

Communication Data Schema

Semantics Consolidation

Information

Meaning Pragmatics Collaboration Process Action/Behaviour

Table 1: The C4IF and its mapping to linguistics


-69-© IBIS – Issue 1 (1), 2006

An important aspect of the proposed framework should be mentioned. C4IF is clearly focused on information systems interoperability but at the same time avoids a mere technology-based approach with respect to organizational integration. This means that it clearly separates and does not mix the advancement in information systems interoperability with the advancement in organizational integration. Although technology-based integration (interoperability) is perceived as a catalyst to achieving advanced levels of organizational integration, organizations exhibiting low information systems interoperability, e.g. due to a general low technology level, may however develop advanced integration. For example, organizations supported by weak technology (e.g. no computerized information system support) may exchange semantically rich information (e.g. using cryptographic code over telegraph) and participate in advanced collaboration patterns (e.g. complex and conditional pre-defined workflow patterns based on the messages received). In this case, we may have an advanced level of organizational integration between the interacting organizations. Conversely, although information systems interoperability is considered a powerful instrument and means for integration, it does not always guarantee the achievement of a high degree of organizational integration as other important “soft” aspects (e.g. culture) may affect the final organizational integration level achieved. Each of the C4IF interoperability levels corresponds to and may be realized through a set of enabling technologies. We present examples of such technologies in the table that follows. The list of technologies presented is indicative.

Connection

(out of content)

Communication

(out of context)

Consolidation

(out of usage)

Collaboration

• Cable • Infrared • Bluetooth

• Data Formats • Data

Dictionary • SQL • E-R Schemas

• Thesaurus • Taxonomies • Common

Vocabularies • RDF Schemas • Ontologies • Semantic Web

technologies

• Workflow Languages • Distributed Workflows • BPML • Service Ontologies • SOAs • Web Services • BPL4WS • Semantic Web Service

technologies

Table 2: Technologies used in the C4IF layers

Two technologies presented in the table deserve special attention: • Semantic Web technologies combine state-of-the-art Connection (e.g. TCP/IP,

http), and Consolidation technologies (e.g. RDF Schemas, Ontologies). • Semantic Web Service technologies combine the Semantic Web family of

technologies with advanced collaboration technologies (e.g. service ontologies, distributed workflows). They are actually a combination of advanced technologies from all the other layers.

Lastly, we provide a mapping between all the above presented interoperability typologies and the interoperability levels they propose vis-à-vis the C4IF layers. The correspondence can be seen in the two figures that follow.


-70- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Figure 2.a: Mapping the interoperability typologies to C4IF layers

Figure 2.b: Mapping the interoperability typologies to C4IF layers

Concluding the presentation of C4IF, we identify the following advantages of the framework when compared to the other interoperability typologies.

• Sound theoretical foundation. The C4IF interoperability types are derived from a well-elaborated theory of language and thus have a sound theoretical foundation that cannot be found in any other of the proposed typologies.


-71-© IBIS – Issue 1 (1), 2006

• Introduction of an innovative perspective to the “interoperability” discussion. The linguistic metaphor that we employ provides a new approach to this demanding and active field. This perspective is based on a simple yet powerful metaphor, analyzing information systems communication (interoperation) as a discourse.

• Well-defined layers. As a result of the two above-mentioned points, the proposed types of interoperability are clearly defined and no ambiguity exists.

• Loosely related layers. The three broader model areas demarcated here (channel, data, and process) may evolve separately, influencing but not determining the advancement of the others. This provides the necessary flexibility to surpass crude categorizations imposed by other models that are based on a single categorizations criterion. Moreover, it overcomes the simplistic, single-dimension linearity imposed by almost all other typologies.

• Clear focus on IS interoperability avoiding technological specificity. C4IF clearly distinguishes advancements in information system interoperability from advancements in organizational integration, which is rarely the case in other interoperability typologies.

• All-inclusive typology. C4IF is a framework where all other typologies can find a place.

Conclusions and Future Work

In this paper, we propose a new approach to information systems interoperability. We perceive the interaction amongst information systems as a discourse and thus we transfer some concepts from linguistics to information systems interoperability research. We plan to further elaborate on the C4IF and put it into practice. One direction is to identify generic types of interoperability obstacles per category. Towards this direction, concepts from linguistics may again prove valuable. We then plan to propose technological and/or other strategies/solutions as a roadmap to overcome the identified obstacles. Another research direction will involve identifying and studying interoperability problems in real eBusiness/eGovernment environments and attempt to group these problems according to the proposed framework. This exercise could also be employed for validation purposes. Interesting areas for future research could also include the interrelationship that exists amongst the various problems in a specific layer and the impact of possible advancements in one layer to others. Our aim is to keep C4IF as domain-independent, transferable and reusable as possible.

References

[1] C4ISR Architectures Working Group: Levels of Information Systems Interoperability (LISI), 1998. [2] NATO Allied Data Publication 34 (ADatP-34): NATO C3 Technical Architecture (NC3TA), Version 4.0, 2003.


-72- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

[3] Tolk A. and J. A. Muguira: The Levels of Conceptual Interoperability Model. Simulation Interoperability Workshop, Orlando, Florida, 2003. [4] Park, J. and S. Ram: “Information Systems Interoperability: What Lies Beneath?” ACM Transactions on Information Systems 22(4): 595–632, 2004. [5] U.S Air Force and Don Brutzman and Tolk A.: Report on JSB Composability and Web Services Interoperability via Extensible Modeling & Simulation Framework (XMSF), Model Driven Architecture (MDA), Component Repositories, and Web-based Visualization. [6] Obrst, L. J.: Ontologies and Semantic Web for Semantic Interoperability. 2004 Semantic Technologies for e-Government Conference, USA, 2004. [7] Obrst, L. J.:. Ontologies & the Semantic Web for Semantic Interoperability, presentation in the SICoP Workshop, 2005. [8] Seligman, L. and A. Rosenthal: “A Framework for Information Interoperability.” The Edge Mitre's Advanced Technology Newsletter 8(1): 3-4, 2004. [9] Clark, T. and R. Jones: Organisational Interoperability Maturity Model for C2, 1999. [10] Klischewski, R.: Information integration or process integration: How to achieve interoperability in administration. EGOV04 at DEXA, Zaragoza, Spain, 2004. [11] IDABC: European Interoperability Framework. Luxembourg, European Communities, 2004. [12] Medjahed, B.: Semantic Web Enabled Composition of Web Services. PhD Thesis, Virginia Polytechnic Institute. Falls Church, 2004. [13] Mylopoulos, J. and M. Papazoglou: “Cooperative Information Systems, Guest Editors Introduction.” IEEE Expert Sep/Oct 1997, 1997. [14] Johannesson, P.: A Language/Action based Approach to Information Modeling. Information Modeling in the new Millenium. Rossi M. and Siau K., IDEA Publishing, 2001. [15] Austin, J. L.:. How to do things with Words. Cambridge, MA, Harvard University Press, 1962.


-73-© IBIS – Issue 1 (1), 2006

Ontology Mapping for Web-Based Educational Systems Interoperability

Amel Bouzeghoub, Abdeltif Elbyed GET/INT, 9, rue Charles Fourier,

91011 Evry, France {Amel.Bouzeghoub, Abdeltif.Elbyed}@int-evry.fr

Abstract. In order to deal with the need of sharing learning objects within and across learning object repositories most of the recent work argue for the use of ontologies as a means for providing a shared understanding of common domains. But with the proliferation of a great number of different ontologies even for the same domain, it becomes necessary to provide a mapping process to perform interoperability. Although many efforts in ontology mapping have already been carried out, few of them use resources properties to generate relations between local concepts. Our approach exploits these properties and uses inference rules to produce correspondences between concepts from source and target ontology.

Introduction

Web-based Educational Systems (WBES) is one of the leading domains where interoperability and sharing is in high demand. Indeed, the abundance of learning resources in the web involves the necessity of sharing and reusing content. Typically, these digital learning objects (LO) may be content stored as text, audio or video media files. Some efforts, deriving from organizations such as Ariadne [18] or EducaNext [19], have developed repositories for storing learning objects (LORs) described using a set of metadata (based on a standard, LOM [20] in most cases). Although these repositories organize the content of their resources and exchange resources, a problem of search and answer accuracy still remain. Semantic has to be associated to metadata values to tackle linguistic, inconsistent use of terms and cultural differences. Tools coming from semantic Web – ontologies– have to be integrated into repositories to organize the different concepts covered by the resources stored in a so called “knowledge domain ontology”. Moreover, they do not offer powerful tools for reusing and composing existing resources. Finally, as the number of LORs increases, the problem of interoperability becomes more and more important, revealing problems of similarities, overlapping and cooperation of knowledge domains. It becomes increasingly difficult for users to obtain relevant information. Nowadays, domain ontologies are recognized as the most important issue in web semantic interoperability. The problem is that users are more familiar with their own domain ontology. It is not easy for them to use multiple ontologies in the remote repositories. Most of the approaches treat the interoperability at a low level. In [16] the interoperability is based on common protocols, which define the interactions between repositories. A set of methods referred to as Simple Query Interface (SQI) has been proposed as a universal interoperability layer for educational networks. In other projects like Elena/Edutella [22] and eduSource [21] the openness is supported by a communication protocol.


-74- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

We consider in our work that the interoperability may be supported at an ontological level and we propose as a first result an algorithm to generate mapping among different ontologies. In this paper we present our approach based on either rules derived by human experts or basic deduction rules. The hypotheses generation combines different similarity measures to find mapping candidates between two ontologies. The rest of the paper is structured as follows: First, we motivate our work by showing how ontologies are used in WBES in general and in our system named SIMBAD in particular. The proposed mapping algorithm is presented in section three. After a comparison to related work in section four, the paper ends with conclusion and remarks on further work.

Motivations and Context

Ontologies offer a great potential in higher education providing in particular the sharing and reusing of information across educational systems and enabling intelligent and personalized learner support. The increased functionality that ontologies imply will bring new opportunities to e-learning. Learners will be able to interact with distant educational systems easily and in a personalized way. An overview of ontologies for education field and an initial report on the development of an ontology-driven web portal O4E are presented in [3]. We have developed a WBES named SIMBAD based on a domain ontology. To facilitate resources exchange between SIMBAD and other WBES it becomes necessary to find solutions allowing the cooperation between various repositories of learning resources. The user may seek resources out of his/her private reference ontology. The problem is that the comprehension of a new classification (a new ontology) is expensive and does not constitute a justified investment. It is thus necessary to propose mechanisms to permit the user to access to resources of other repositories in a transparent way using his/her favourite WBES (and the associated shared reference ontology).

Ontologies for SIMBAD

This section presents the logical architecture of our system. A more detailed description is given in [2]. Our system is aimed at two categories of users that is authors of resources and learners. It is based on three models: (1) the domain model represented by an ontology which represent a normalized and common referential among all users of the system. This ontology is based on the ACM/CCS classification for the computer science domain [17]. This model will serve to semantically index the learner and the learning resources (2) the learner model is a view on the domain model. The set of learner knowledge is modeled by links to the domain model (3) the learning object model gives a semantic description of a learning object. In order to be found and re-used, a resource must be described by a set of metadata. We distinguish two types of metadata: the first one describes general characteristics of the resource (e.g., author, title, language, media) using LOM standard and the


-75-© IBIS – Issue 1 (1), 2006

second one describes the semantic of the resource. This semantic is structured in three parts and described in the same way as software components: prerequisites are the resource inputs (what is required by the resource) whereas content and acquisition function are its exits (what is provided by the resource). A resource can be a set of web pages, a file or a program (a simulator for example). We just suppose that it is a unit accessible via an URI and we consider it as an instance of an ontology concept.

Algorithm Principles

In this paper we present an approach to combine different similarity measures to find mapping candidates between two ontologies. This algorithm is based on three steps: the first step consists in generating information from the ontology. It uses the instances comparisons for deducing relations between concepts (convergence, divergence) of the same ontology. The second step calculates the similarity between candidate couples of concepts by using inference rules and rules derived by human experts. Before describing the algorithm, we need some more definitions and notations.

Definitions and Notations

φ-relation. A φ-relation describes semantic properties among resources. These properties are presented in [1]. We define ϕ : I I’ where I and I’ are sets of instances. Given r and r’ any element of I and I’ respectively, a φ-relation can be one of following relation types presented in table1 (pre(r) and cont(r) denote respectively prerequisites and content sets):

Table 1. φ-relation definition.

ϕ-relation Definition substitution r is substituted to r’, if pre(r)

= pre(r’) equivalence r is equivalent to r’, if r is

substituted to r’ and cont(r)= cont(r’)

weak-precedence

r weakly-precedes r’ if cont(r) ⊆ pre(r’)

strong-precedence

r strongly-precede r’, if cont(r)=pre(r’)

These new relations among resources will enrich the ontology relationships. This yields a corresponding series of definitions Enriched Ontology Definition. An ontology O is a tuple O=(C, R, <, σ, ⊥, |) where (i) R and C denotes two disjoint sets called concept identifiers and relation identifiers respectively, (ii) < denotes a partial order on C called concept hierarchy or taxonomy, (iii) σ : R C×C denotes a function called signature that associate a relation to a couple of concepts, (iv) ⊥ denotes a relation called divergence between two concepts (i.e. there is no ϕ-relation between resources associated to


-76- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

the concepts), (v) | denotes a relation called convergence between two concepts (i.e. there is at least one ϕ-relation between associated instances). We adopt the following logic notation presented in table 2 to express relations between concepts; c and d are two concepts:

Table 2. Logical relation between concepts.

Condition Logical notation c < d ca d c ⊥ d c,da c | d a c,d

Degree of Convergence. We define the degree of convergence notion to measure the convergence among concepts. Two concepts are more or less convergent depending upon the number of common resources. The degree of convergence between two concepts c and d, noted Ðc,d is given in the following formula :

( )dcdandcbetweeninsancescommonofnumber

,min Ð dc, =

Where |c| (resp. |d|) is the total number of instances linked to the concept c (resp. d). Ontology Morphism Definition. The comparison of two concepts in the same ontology is equivalent to the comparison of their images in a different ontology. For example, if c precedes d in the first ontology, their corresponding concepts F(c) and F(d) respect the same relation. This leads to the following definitions: An ontology morphism between two ontologies O=(C, R, <, ⊥, |, σ) and O’=(C’, R’, <’, ⊥’, |’, σ’) is the couple of function (F,G) such that F : C C’ and G : R R’ Given c and d two elements of C, the following relations are generated:

Table 3. Morphism properties.

Condition in O

New relation in O’

c < d F(c) <’ F(d) c ⊥ d F(c) ⊥’ F(d) c | d F(c) |’ F(d) σ(r)=(c,d) σ’(G(r))=(F(c),F(d))

Mapping Process

Our mapping approach is based on multiple iterations. Different similarity measures are used by applying inference rules. In this section we describe the different steps of the mapping process after a brief definition of the ontology mapping.


-77-© IBIS – Issue 1 (1), 2006

Given two ontologies O and O’, mapping one ontology onto another means that for each entity concept c in source ontology O, we try to find a corresponding concept c’, which has the same intended meaning, in the target ontology O’. The mapping process illustrated in Fig. 1 includes four main steps, starting with two ontologies, which are going to be mapped as its input: The derivation of ontology mappings takes place in a search of candidate mappings. The similarity computation determines similarity values of candidate mappings. Hypotheses are then generated using a rule base. This rule base contains a set of deductive rules which may be enriched with new rules proposed by domain experts. The “best” similarity hypothesis is selected. Each step can be repeated for multiple rounds and exchanges messages with previous step if necessary.

Fig. 1. Different steps of mapping process

Similarity Computation: The similarity computation is an iteration process. The first iteration consists in providing a basic similarity between concepts. In this iteration we use linguistic tools to compare concepts' names. In the ith iteration we use the similarity produced in (i-1)th iteration and we apply the inference rules. These inference rules are either rules inferred from morphism ontological definition or rules proposed by the domain expert. 1st Iteration: Similarity Computation Using Linguistic Comparisons. In the first step, basic similarities are set via measures based on linguistic comparisons which are independent of the next similarities measures. Several ideas have been developed using concept names comparisons [11], dictionaries (e.g.WordNet [24]), identifiers such as URIs, etc. We present below examples of methods and functions:

1. The use of existing tools based on dictionaries, like Nuno and Rocha in [15] who use WordNet to identify four type of relations between two concepts A and B: − A ≡ B (i.e sim(A,B)=1.) if there exists a meaning of A synonym to a

meaning of B − A ⊇ B (i.e sim(A,B)=0.7) if there exists a meaning of A hyponym to a

meaning of B − A ⊆ B (i.e sim(A,B)=0.3) if there exists a meaning of B hyponym to a

meaning of A − A ⊥ B (i.e sim(A,B)=0.) if there is no relation between the meaning of A

and the meaning of B − String equality :

Similarity computatio

n

Hypothesis generation

Hypothesis filtering

Hypothesis choice

Expert

Rules


-78- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

2. Similarity measure between two strings on a scale from 0 to 1

SimStr(c, d) = )),(

),(),(,0(dcMin

dceddcMinMax −

Combining these methods will bring better results. ind iteration : Similarity Computation Using a Rule Base. After getting some relations between concepts based on linguistic solution, we use a rule base to find new similarities between ontologies. The rule base contains two sets of rules: a first set of basic deduction rules and a second set of rules proposed by the domain expert. Each rule shall give an indication on whether two concepts are similar but none provides for itself the mapping. The rules give only a similarity weight between two compared entities. A threshold is defined on the similarity values to determine the correspondence or the non-correspondence. Thanks to the ontology morphism definition, the set of deduction rules will generate, at the iteration i, new similarity relations from the iteration i-1. For a progressive and dynamic similarity generation we define a similarity function F~: [ ]1,0'→× CC which associate for each couple of concept a degree of similarity comprise between 0 and 1. The application of any comparison method (structural or semantic) will increase the similarity value and therefore the F~() value. The following examples of deduction rules will illustrate the mechanism of similarity computation; c and d (respectively, c’ and d’) designs two concepts of the ontology O (respectively, O’) and nbr-child(c) is the number of sub-concepts of c. R1 IF F~(c,c’) increases its value THEN

')',( CCdd ×∈∀ such that ( dc a and '' dc a ): F~(d,d’)= F~(d,d’) + (F~(c,c’)/nbr-child(c))

R2 IF F~(c,c’) ) increases its value THEN

'' Cd ∈∀ such that ',' dca : F~(c,d’)= F~(c,d’) + (F~(c,c’)× Ðc’,d’) &

Cd ∈∀ such that dc,a : F~(c’,d)= F~(c’,d) + (F~(c,c’)×Ðc,d) Besides these rules, the domain expert can propose other rules to ameliorate the result quality. In the next iteration the overall similarities between concepts are

1 if c.char(i)=d.char(i) ∀ i € [0,|c|] with |c| = |d| SimStrequ(c,d) = 0 otherwise


-79-© IBIS – Issue 1 (1), 2006

calculated based on new similarity measures proposed by the expert and processed with an inference engine. Table 4 shows examples of rules for concepts comparisons. This set of rules may evolve dynamically. The manual effort is necessary because ontology mapping is too complex to be directly mapped by defaults rules.

Table 4. Example of similarity rules between concepts.

Rules Description R1 Two concepts are similar, if their names are

similar R2 Two concepts are similar, if their URI is similar R3 Two concepts are similar, if their “father”

concept are similar R4 Two concepts are similar, if their “child”

concept are similar R5 Two concepts are similar, if their associated

instances are similar Hypothesis Generation. The hypotheses generation at iteration (i) is based on either the mapping set or the similarities generated at the iteration (i-1). We use the deduction rules and the comparison rules to propose new correspondences between concepts. Indeed, mapping hypotheses are generated for all couple of concepts depending on the similarity value (F~). Hypothesis Filtering. During this step hypotheses which do not verify certain constraints (e.g. c⊥d and F(c) |’ F(d)) are removed. These are examples of rules which compare two hypotheses in order to eliminate the weakest one. Given two hypotheses hyp1 and hyp2 such that: hyp1: F~(c ;c’) and hyp2 : F~(d ;d’) with F~(d ;d’)>F~(d ;d’); IF dc a and '' cd a THEN eliminate hyp2. IF ( adc, and ',' dca ) or ( dc,a and a',' dc ) THEN eliminate hyp2. Furthermore, we can define a similarity threshold below which the hypotheses are not considered. Hypothesis Choice. In this step, the hypotheses list provided by the filter is browsed and the best similarities are chosen. If none of the received hypothesis is selected, the precedent steps 3 and 4 are repeated for multiple rounds by decreasing the threshold. In the last step, only the best similarities are considered in the final mapping table.

Prototype

We are currently implementing the different mapping process using a multi-agent system. We associate one agent to each step of the process (e.g. similarity computation).


-80- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

Agent definition: An agent is a software component that has a role to play in the functioning of the system [13]. The degree of granularity is not equal for all agents: some of them play more important roles than others. An agent should have the ability of interacting with other agents and possibly humans (Expert) via an agent-communication language [9] [7]. Therefore the following performances can be reached: • High performance: agents can run in parallel. They can be cloned when their

work is too important; • High flexibility: an agent can be programmed for any context; this means that

the agent can directly interfaces different ontologies; • High modularity: the number of interconnected sources can increases with no

limits. We use a JADE (Java Agent Development Framework) platform [25] to implement all the agents. JADE is conforming to the FIPA standard (Foundation for Intelligent Physical Agents). The agents communicate by exchanging messages in ACL language (Agent Communication Language) [16]. We include a rule-based inference engine called JESS [23] and to deal with ontologies and provide a programming environment for RDF, RDFS and OWL, we use the Jena framework [26] (A Semantic Web Framework for Java).

Related Work

Various works have been developed for supporting the mapping of ontologies. An interesting survey which gathered 35 works is presented in [10]. In [4,10] we can find other surveys on ontology alignment. In most approaches heuristics are described for identifying corresponding concepts in different ontologies, e.g. comparing the names or the natural language definition of two concepts, and checking the closeness of two concepts in the concept hierarchy. PROMPT [12] is an algorithm for ontology merging and alignment based on identification of matching class names. A few approaches like RDFT [14] use the comparison of the resources to determine a similarity between concepts, but the problem is that the structures of all data instances are heterogonous. RDFT proposes an approach to the integration of product information over the web by exploiting the data model of RDF, which is based on directed label graphs. RDFT discovers a similarity between classes (concepts) based on the instance information for this class, using a machine-learning approach. Like RDFT, GLUE [10] is a system which employs machine learning technologies to semi-automatically create mappings between heterogeneous ontologies. An ontology is considered here as a taxonomy of concepts and the problem of matching is reduced to: “for each concept node in one taxonomy, find the most similar node in the other taxonomy”. The problem of Glue is that the reliability of the results is related on the quantities and the degree of correction of all examples used by machine learning. S-Match Semantic Matching [8] is an approach to matching classification hierarchies. The problem addressed by Semantic Matching is the following: say you have two different classification hierarchies, where each hierarchy is used to describe a set of documents, i.e. each term in the classification hierarchy describes a set of documents. How do the terms in one hierarchy relate to the


-81-© IBIS – Issue 1 (1), 2006

terms in the other hierarchy? The proposed algorithm returns all possible similarities between both graphs based on synonyms sets from thesauri, using a SAT solver. The tools described above offer mappings between heterogeneous ontologies. Most of them are based on syntactic and semantic matching heuristics given by an expert. None uses deduction rules which can be used for different application domains. Deduction rules offer more flexibility to the system. In addition, the closest work to our approach is QOM [5, 6] which is considered as a way to trade off between effectiveness and efficiency. One of the conclusions presented by the authors is “Using an approach combining many features to determine mappings, clearly leads to significantly higher quality mappings”. In our mapping approach, we try to use as much as possible available information contained in the ontology. This information consists of identifiers names of concept/relation, ontology structure, resources (concepts instances) and manual/automatic rules. Resources properties generate new semantic relations between concepts (concepts of the same ontology).

Conclusion

Nowadays, myriad of Web-Based Educational Systems exists, each of them storing their own learner’s models and resources. Solutions have to be defined to open these systems to each others. Learners should be able to access to distant learning resources in a transparent way (without changing their usual reference ontology). Our objective is to be able to query different LOR and thus improving the interoperability of such systems. In this paper we have introduced a mapping approach for bridging gaps between learning object repositories based on ontologies. This algorithm is applied on an existing WBES that allows learners and teachers searching, adding and composing new resources in a local repository. The particularity of the algorithm is that (i) it uses information on the resources to enrich the local ontology by generating relations between local concepts (ii) it is based on inference rules. Some of them are basic ones; others can be added by a domain expert. This flexibility permits its application to other domains. The prototype is based on multi-agents technology. It is implemented with the JADE platform and the Jess rule-based reasoning engine. In future work, we plan to add other match and techniques in order to resolve more complex mapping problems (e.g. cardinality n:m).

References

[1] Bouzeghoub A., Defude B., Ammour S., Duitama J.F., Lecocq C.: A RDF Description Model for Manipulating Learning Objects, Proc. IEEE International Conference on Advanced Learning Technologies, Joensuu, Finland, August 2004 [2] Bouzeghoub, A., Carpentier, C., Defude, B., Duitama, J.F.: A Model of Reusable Educational Components for the Generation of Adaptive Courses. Proc. Of the International Workshop on Semantic Web for Web-based Learning, with CAISE'03 Conference, Klagenfurt, Austria, June 2003


-82- © IBIS – Issue 1 (1), 2006

IBIS – Issue 1 (1), 2006

[3] Dicheva1, D., Sosnovsky, S., Gavrilova, T., Brusilovsky, P.: Ontological Web Portal for Educational Ontologies. International Workshop on Applications of Semantic Web technologies for E-Learning (SW-EL’05) july 2005 [4] Euzenat J., Le Bach, T., Barrasa J., Bouquet P., De Bo J., Dieng-Kuntz R., Ehrig M., Hauswirth M., Jarrar M., Lara R., Maynard D., Napoli A., Stamou G., Stuckenschmidt H., Shvaiko P., Tessaris S., Van Acker S., Zaihrayeu I.: State of the art on ontology alignment. Deliverable 2.2.3, IST Knowledge Web NoE, 21p., June 2004 [5] Ehrig, M., Sure Y.: Ontology Mapping - An Integrated Approach April 21,2004, ESWS 2004: 76-91 [6] Ehrig, M., and Staab, S., QOM— Quick Ontology Mapping, ISWC2004 7-11 Nov. 2004, Hiroshima, Japan [7] Franklin S. and Graesser A.: Is it an Agent, or just a Program? A Taxonomy for Autonomous Agents. Proceedings of the Third International Workshop on Agent Theories, Architectures, and Languages, Springer-Verlag, Berlin, Germany, 1996. [8] Giunchiglia, F., Shvaiko P.: Semantic matching. The Knowledge Engineering Review, 18(3):265–280, 2004. [9] Jennings, N. R.: On agent-based software engineering. Artificial Intelligence, 2000. [10] Kalfoglou Y., Schorlemmer M.,Ontology mapping: the state of the art, The Knowledge Engineering Review 18(1):1--31, January 2003 [11] Levenshtein, I.V.: Binary codes capable of correcting deletions, insertions, and reversals.Cybernetics and Control Theory (1966) [12] Natalya, F., Noy, A., Mark, A.: Musen. Ontology versioning as an element of an ontology-management framework. To be published in IEEE Intelligent Systems, 2003. [13] Nwana, H..S.: Software Agents: An Overview. Knowledge Engineering Review 1996. [14] Omelayenko, Borys.: Integrating vocabularies. Discovering and representing vocabulary maps. In Proceedings of the First International Semantic Web Conference (ISWC2002), Sardinia, Italy, 2002. [15] Silva, N., Rocha, J.A.: Ontology mapping for interoperability in semantic web. In Proceedings of the IADIS International Conference WWW/Internet 2003 (ICWI’2003), Algarve, Portugal, 2003. [16] Simon, B., Massart, D., Van Asche, F., Ternier, S., Duval, E., Brantner, S., Olmedilla, D., Miklós, S.: A Simple Query Interface for Interoperable Learning Repositories. WWW 2005, may 10–14, 2005, Chiba, Japan Web references [17] The ACM Computing Classification System (1998), available from http://www.acm.org/class/1998/ccs98.html accessed in feb. 2006 [18] ARIADNE, Ariadne foundation for the European Knowledge Pool, available from http://www.ariadne-eu.org/ accessed in feb. 2006 [19] EducaNext, available from http://www.educanext.org/ accessed in feb. 2006 [20] IEEE. Draft Standard for Learning Object Metadata. (IEEE P1484.12.1.), available from <http://ltsc.ieee.org> accessed in feb 2006 [21] Edusource http://www.edusource.ca/ accessed in feb. 2006 [22] Edutella http://edutella.jxta.org/ accessed in feb. 2006 [23] Jess http://herzberg.ca.sandia.gov/jess/index.shtml [24] Wordnet http://www.wordnet.com [25] Jade http://jade.tilab.com/ [26] Jena http://jena.sourceforge.net/


-83-© IBIS – Issue 1 (1), 2006

Biographies

Amel Bouzeghoub received a degree of Ph.D. in Computer Sciences at Pierre et Marie Curie University, (France). She is currently an associate professor in the department of Computer Science at GET-INT (Institut National des Télécommunications, Evry, France). Her research interests include topics related to semantic web, knowledge representation (ontologies and their interoperability, knowledge sharing and reuse), and personalized web-based learning systems. She is currently involved in the AIDA virtual multidisciplinary research team.

Abdeltif Elbyed received his M.Sc. degree in Computer Science from the University of Evry (France). He is currently a PhD student at GET-INT (Institut National des Télécommunications, Evry, France). His research interests include distributed ontology engineering, data management and agent-based technology.


-85-© IBIS – Issue 1 (1), 2006

Call for Articles New issues of the IBIS journal are released four times a year. Authors are encouraged to submit articles for inclusion in the next issue of the International Journal of Interoperability in Business Information systems. We are interested in both theoretical and practical research. We are also interested in case studies and results of recently finished research projects. Although our focus is on research results, we also accept a small amount of practical articles. Contributions should be original and unpublished and need to have a high quality. Topics of the journal include, but are not limited to:

• Integration of business information systems • Enterprise modeling for Interoperability • Interoperability architectures and frameworks • Interoperability of business standards • Intelligent mediators • Coupling of information systems • Interoperability of classification systems • (Semi-)Automatic transformation of standards • Interoperability of (meta) models • Semantic integration concepts • Interoperability between domain specific standards • Semantic analysis of structured, semi-structured, and unstructured data • Interoperability of catalog based information systems • Cooperation in heterogeneous information systems • Ontology matching, merging and aligning • Semantic combination of heterogeneous standards • Ontology- and model management • Interoperability of sector specific systems

Authors are encouraged to submit high quality articles online. Please carefully look at the submission guidelines to prepare your submission. All submissions should be made online using our online submission system. In case of any problems, please contact us via email. For additional information and deadlines, please visit our website at

http://www.ibis-journal.net

Interoperability in Business Information SystemsIBIS – Interoperability in Business Information...

Documents

Transcript of Interoperability in Business Information SystemsIBIS – Interoperability in Business Information...