Semantic Web Nature

13
Nature Inspired Methods for the Semantic Web Monica Macoveiciuc and Constantin Stan Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi Abstract. The Semantic Web is a vision of information that is under- standable by computers. Although there is great exploitable potential, we are still in “Generation Zero” of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most im- portant aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, arti- ficial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.

description

The Semantic Web is a vision of information that is understandable by computers. Although there is great exploitable potential, we are still in "Generation Zero'' of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most important aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, artificial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.

Transcript of Semantic Web Nature

Nature Inspired Methods for the Semantic Web

Monica Macoveiciuc and Constantin Stan

Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi

Abstract. The Semantic Web is a vision of information that is under-standable by computers. Although there is great exploitable potential,we are still in “Generation Zero” of the Semantic Web, since there arefew real-world compelling applications. The heterogeneity, the volumeof data and the lack of standards are problems that could be addressedthrough some nature inspired methods. The paper presents the most im-portant aspects of the Semantic Web, as well as its biggest issues; it thendescribes some methods inspired from nature - genetic algorithms, arti-ficial neural networks, swarm intelligence, and the way these techniquescan be used to deal with Semantic Web problems.

Introduction

The World Wide Web is a universal medium for information and data exchange.Exploiting the huge amount of knowledge distributed on the Web is a significantchallenge. Humans can understand the information, but it takes great effort tofind and combine data from such a large number of sources; on the other hand,computers can easily browse through millions of pages in no time, but they arenot capable of understanding the content. The Semantic Web is a new paradigmfor the Web in which the semantics of information is defined, making it possiblefor the web to understand and satisfy the requests of people and machines to usethe web resources[1]. In other words, the Semantic Web is a vision of informationthat is understandable by computers. It contains a set of design principles anda variety of enabling technologies. Some of the elements are expressed in formalspecifications, while others are still to be rigorously described.

The ontology is a key aspect of the Semantic Web, although it does not havea universally accepted definition. It is described as “a formal specification of ashared conceptualization” [2]. There is no commonly agreed ontology that everydata provider would rely on; the information is heterogeneous and distributed.Existing reasoning techniques may not be able to deal with the different ontolo-gies describing the same piece of knowledge, with the high number of instances,with the lack of maintenance, the unreliability of the network, the variety of qual-ity of the information available on the web. Given this context, soft computinghas an important role in coping with knowledge, and the methods inspired fromnatures might be able to suggest interesting solutions for these problems.

This paper presents nature inspired techniques that can address some of themain issues of the Semantic Web. Genetic algorithms, swarm intelligence orneural networks could represent viable solutions for overcoming problems suchas ontology alignment, concept classification, RDF query path optimization etc.

Semantic Web

Advanced management information is the main benefit brought by the SemanticWeb vision. One should stop browsing documents and start performing concretequeries. New knowledge should be inffered from the existing facts. The potentialadvantages of these achievements are multiple:

– information can be located based on its meaning;

– information from different sources can be combined, summarized, presentedto the user in a improved format;

– information can be integrated across different sources.

1 Technologies

Semantic Web technologies can be considered in terms of layers, each of themresting on and extending the functionality of the layers beneath it. The hierarchyof the most important languages and technologies is described in the famous“Layer Cake” diagram [3].

Semantic Web Layer Cake

The core technologies are RDF (Resource Description Framework) and RDFS(RDF Schema). RDF is a markup language for describing information and re-sources on the web. Any object that is uniquely identifiable by an URI (orUniform Resource Identifier) is considered a resource. Resources have properties(attributes or characteristics).

The RDF model is a collection of facts, represented by statements (triples).Each triple consists of a subject, a predicate and an object. The most commonrepresentation of the triple is the graph-based one: subject-predicate-object isseen as a node-arch-node link. The statements are unambiguous and have a uni-form structure; each concept is defined in a dedicated space on the web. Forexample, the statement “Jane is Tom’s mother” can be expressed is RDF as:

<rdf:Description rdf:about=www.persons.org/#jane>

<s:isWoman>Jane</s:isWoman>

<s:hasChild rdf:resource= www.persons.org/#tom>

</rdf:Description>

In order to describe general statements about classes or groups of objects, weuse RDF Schema, or RDFS. RDFS provides a basic object model, while RDFrefers to specific objects. The statement above can be described in RDFS as “Awoman is someone’s mother”.

RDF and RDFS allow us to describe aspects of a domain, but the modelingprimitives are too restrictive to be of general use. The taxonomic structure ofthe domain, the restrictions and constraints cannot be described through thismodel. It is also not possible to reason over inference rules. All these limitationsare overcome with the use of ontologies. Ontologies provide a common under-standing of a domain of interest. The specification is formal, which means thatcomputers can perform reasoning about it. OWL (Web Ontology Language) isa family of ontology languages, and it is the W3C specification for creatingSemantic Web applications. OWL builds upon RDF and RDFS and defines hi-erarchies and relationships between resources. Semantic Web ontologies consistof a taxonomy and a set of inference rules from which machines can make logicalconclusions. A taxonomy is a system of classification that groups resources intoclasses and sub-classes based on their relationships and shared properties.

The top layers of the Layer Cake are very important in the context of Seman-tic Web applications deployment. The trust layer deals with authentication andreliability of data and services, through the use of digital signatures, ratings bycertification agencies, recommendations by trusted agents etc. The proof layerallows applications to give proof of their conclusions and it includes the actualdeductive process, validation etc. Several refinements have been proposed for theSemantic Web Layer Cake. One of them, suggested by sir Tim Berners-Lee in2006, includes new features, such as:

– Rules and Inferencing Systems (RIF). It is a language for representing rulesand for linking rule-based systems; the formalisms are being extended inorder to encapsulate probabilistic, temporal and causal knowledge.

– RDF Extraction. GRDDL “Gleaning Resource Descriptions from Dialects ofLanguages”) is a language that identifies when an XML document containsdata compatible with RDF and it is capable of extracting the data.

– Database Support for RDF. Oracle provides support for RDF and OWLdatabases; for the moment, the focus is on storage, rather than inferencingcapabilities. There are various open source projects that offer solutions forstorage - such as Jena, as well as query languages for RDF (SPARQL beingthe most important).

Revised Semantic Web Layer Cake

2 Current Problems

Although the Semantic Web vision has great potential, for more than a decadeit has been “a kind of academic exercise rather than a practical technology” [4].One of the main reasons is the lack of a common understanding of what theSemantic Web can offer and, more particularly, of the role of ontologies.

RDF and OWL can be confusing and complicated to understand for less-technicalpeople. There is a huge amount of information that needs to be annotated inorder to be processed and infered, and the two possible solutions for this areboth hard to put into practice: either an automatic process should apply an al-gorithm that takes a piece of text and produces RDF, or people should manuallyannotate existing documents. The first approach - of an intelligent algorithm -is unlikely, since having such an algorithm would make the RDF and OWL seemdeprecated. Manual annotation is inefficient and prone to error.

One of the biggest issues of the Semantic Web is that it seems to be scat-tered into small pieces. The existing initiatives and applications focus on smalldomains and the access to the Semantic Web seems limited from the perspectiveof the average user.

However, there are already a wide range of applications in existence or underdevelopment. Some typical areas seem to offer a great potential (although notfully exploited for the moment) for the development of such applications.

1. E-Science. These kind of applications involve large data collections that requirecomputationally intensive processing. The participants are usually distributed

across the world. A representative project is the Gene Ontology (GO) [5] one.GO is a major bioinformatics initiative with the aim of standardizing the repre-sentation of gene and gene product attributes across species and databases. TheHuman Genome Project, finalized in 2003, is probably the most famous e-scienceproject.

2. Travel Information Systems. There are efforts in the direction of buildingXML based specifications which would allow the interchange of information be-tween companies. The benefits would be major for the users, since they wouldbe able to easily plan the whole trip - accommodation, transportation etc. Thebig issue for the moment is the inexistence of an agreed ontology for this domain.

3. Digital Libraries. Over the past years, institutions such as universities, li-braries and museums have made their large inventories of materials availableonline. Although they have the same goal, the implementations of these sys-tems are totally different. It is difficult for an institution to access another one’scatalogues. One solution for this problem is the use of ontologies and of someontology mapping techniques, that would help achieve semantic interoperability.

4. Health Care. This domain stands to gain tremendous benefit by adoptionof Semantic Web technologies, as it depends on the interoperability of informa-tion from many domains and processes for efficient decision support.

At present, the Semantic Web is increasingly used by small and large business.Oracle (RDF management platform), IBM, Adobe (tool for adding RDF-basedmetadata to most of their file formats), Software AG, or Yahoo! are the most im-portant corporations that have already started working with these technologiesand are already selling tools, as well as complete business solutions. In August2008, Microsoft bought Powerset, a semantic search engine, for a reported $ 100million.

There are also open source applications, such as Protege [6] and Kowari [7],that provide building blocks for application development, making it more costeffective to develop Semantic Web products.

Nature Inspired Methods in the Context ofSemantic Web

The vast amount, the variety and heterogeneity of the data involved in the Se-mantic Web vision makes it sometimes difficult for applications to deal with it,turning many real world problems into NP-hard problems. Nature inspired rea-soning might be able to adress and solve some of these issues.

Natural computing finds its source of inspiration in biological phenomena andsocial behaviors from mainly insects and birds. Such algorithms are able to findacceptable results for NP-hard problems within a reasonable amount of time,rather than guarantee the optimal solution. The most important methods in-spired from nature include genetic algorithms, neural networks, particle swarmor ant colony optimization etc.

3 Genetic Algorithms

Genetic Algorithms (GAs) consist of some model techniques used by simple bi-ological system. These systems use reproduction to produce offspring that canbetter survive in their environment. Genetic algorithms use reproduction oper-ators (mutation and crossover) and strategies (’survival of the fittest’) inspiredfrom these realities, in order to improve the quality solutions to a particularproblem. The advantage of GAs compared to other algorithms and methods isthat they make only few assumptions about the underlying fitness landscapeand, therefore, they perform well in many different problem categories.

These algorithms proceed according to a simple scheme:

1. a population of random individuals is created;2. each individual is tested in order to determine its utility as solution;3. a fitness value is assigned to each individual, based on the previous evalua-

tion;4. a selection process filters out the individuals with low fitness and allows those

with good fitness to enter the mating pool with a higher probability;5. a reproduction process creates offspring by combining or varying the solution

candidates;6. if the termination criterion is met, the evolution stops; otherwise, it continues

starting with step 2.

Genetic algorithms can be a viable solutions for different problems that Seman-tic Web is confronting with, such as RDF query path and ontology alignmentoptimization, Semantic Web services composition etc.

The possibility of querying large amounts of data from different, heterogeneoussources, in an efficient way, is an unsolved problem at the moment. In this con-text, an interesting research field is the determination of query paths - the orderin which the parts of a query are evaluated. The order has a major role when itcomes to the execution time of the query, thus a good algorithm for determiningthe query path can contribute to quick, efficient querying. Genetic algorithmshave been already tested, with some success, in problems related to this field.The Iterative Improvement algorithm, followed by Simulated Annealing - alsoknown as the ’Two-Phase Optimization’ - addresses the optimal determinationof query paths.

A RDF query can be seen as a chain of subject-predicate-object triples. It canbe visualized as a tree, in which the leaf nodes represent the inputs and theinternal nodes are relational algebra operations. The nodes in such a query canbe ordered in many different ways, all of them producing the same result, butwith different execution time. In these conditions, the challenge consists in de-termining the order in which the nodes should be placed, in order to optimizethe response time.

It is not difficult to identify the solution space of the problem as the set ofall the possible RDF trees. A population can be created by randomly select-ing some of these threes (the chromosomes). A simple mutation operator wouldswitch the order of two random nodes (triples) in a chromosome. A crossoveroperator would pick some of the nodes from a chromosome, conserving the order,and put them together with the missing nodes taken from a second chromosome(also conserving the order in this second chromosome). The fitness function iscalculated based on the execution time. Long executing times are not desirablefor a GA in an RDF query execution environment, therefore the stopping con-dition should also consider (or be complemented with) a time limit.

Another interesting problem is the ontology alignment optimization. At the mo-ment, there is no general agreed standard when it comes to ontologies. Thediversity of data makes it even less likely that such a standard would be possiblein the near future - the standards do not often fit to the specific needs of allthe participants in a potential standardization process; and it is very difficultand expensive for many organizations to reach an agreement. Thus, ontologyalignment is a key aspect in order to make the knowledge exchange possible inthe context of Semantic Web.

Many attempts have been made to solve this issue using different combinationsof matchers, such as string normalization or similarity, data type comparison,linguistic methods, inheritance analysis, graph mapping, taxonomy analysis etc.A solution involving genetic algorithms would be able to cope with huge amountsof data, without requiring human intervention. There are two difficult tasks whendefining the problem from the GA point of view: the content of a tentative solu-

tion should be encoded in a string of values, and a good fitness function shouldbe provided (a similarity measure function between two ontologies). Geneticsfor Ontology ALignments (GOAL) [8] is a software tool for optimizing ontol-ogy matching functions. GOAL defines the alignment evaluation process basedon four goals: optimizing the precision, optimizing the recall, optimizing thef-measure or reducing the number of false positives. A chromosome is definedthrough a method that converts a bit representation to a set of floating-pointnumbers in the real range [0, 1]. The fitness function consists of selecting one ofthe parameters retrieved by an alignment evaluation. The parameters are:

– precision - the percentage of items returned that are relevant;– recall - the fraction of the items that are relevant to a query;– f-measure - a harmonic mean from precision and recall;– false positives - relationships which have been provided although they are

false.

The algorithm has its limitations, but it has managed to find the optimal so-lution for different instances of the ontology mapping problem, in an efficient way.

Semantic Web service composition consists in finding web services (availablein a repository) that are able to accomplish a certain task. The task is definedin a form of a composition request that contains a set of available input pa-rameters and a set of wanted output parameters. The parameters are not theexplicit values, but concepts from an ontology describing the semantics of thevalues. A sequence of services is called a composition. If the input parametersgiven in the request are provided, the services from this sequence can be subse-quently executed and will finally produce the desired output parameters. For agenetic algorithm approach, one needs to find a way of representing a web servicesequence as a chromosome. A simple solution is to use strings of service identi-fiers, which can be processed by standard genetic algorithms. Considering thatthe chromosomes can have variable length, the normal GA operators could bemodified in order to make the search more efficient. This operation either deletesthe first service from a sequence, or adds a promising service to the sequence.The other standard GA operations can be easily applied.

4 Neural Networks

An artificial neural network (ANN) is a system loosely modeled based on thehuman brain, an emulation of the biological neural system. It consists of an in-terconnected group of artificial neurons. The information is processed using aconnectionist approach to computation. Generally, an ANN is an adaptive sys-tem, changing its structure according to the information that flows through thenetwork during the learning phase. They can be used to model complex rela-tionships between inputs and outputs or to find patterns in data.

In the context of Semantic Web, artificial neural networks can be used in the

process of ontology mapping. The heterogeneity among different ontologies isone of the biggest issues in this field nowadays. Web applications are developedby different parties, that design their own ontologies, according to their ownviews of the world. Many approaches have been proposed, in order to deal withthis heterogeneity, but each of them has its drawbacks. A centralized ontologyis very unlikely, so the efforts are now focused on distributed solutions: tryingto match the individual ontologies, and possibly reuse each other as well. Mostof the existing techniques are either rule-based, or learning-based, but both cat-egories have their disadvantages.

A different approach combines rule-based and learning-based solutions, integrat-ing machine learning techniques, such that the weights of a concept’s semanticaspects can be learned from training examples, instead of being pre-defined. Inthe real world, a common problem that can occur is the lack of instance data- either in quantity or quality. This method avoids this problem, because thelearning process is carried out at the schema level, instead of the instance level.

Artificial neural networks are a good solution for the learning process, for manyreasons: instances are represented by attribute-value pairs; the target functionoutput is a real-valued one; fast evaluation of the learned target function ispreferable. ANN are also known to perform well in the presence of noisy data.If the ontologies are to be learned from uncontrolled data, such as real existingweb pages, the handling of noise becomes a real issue.

Another interesting approach to the problem of ontology mapping is the useof interactive activation and competition (IAC) neural networks (NN) to searchfor a global optimal solution to best satisfy the ontology constraints. An IAC neu-ral network consists of a number of competitive nodes connected to each other.Each of these nodes represents a hypothesis, while the connection between twonodes is a constraint between their hypotheses. The connection can be eitherpositive (activation) - if the hypotheses support each other - or negative (com-petition). Each connection has a weight, which is proportional to the strengthof the constraint. The activation of a node is determined by more sources:

– the initial activation;

– the input from its adjacent nodes;

– its bias;

– the external input.

The characteristics of ontology mapping and the mechanisms of the IAC networkhave common properties. The constraints in ontology mapping can be interactiveor competitive between mapping hypotheses. Before applying a neural networkbased algorithm for learning, a preliminary mapping is made, which estimatesboth the linguistic and structure information of ontologies. These prior knowl-edge can be sees as external inputs or bias of a node in the IAC network.

5 Swarm Intelligence

Swarm intelligence is another approach to problem solving that takes inspirationfrom social behaviors of insects and of other animals. Particularly, ant colonyoptimization is one of the most successful techniques. Ant colony optimization(ACO) is inspired by the ants that deposit pheromone on the ground in orderto mark some favorable path that should be followed by other members of thecolony. A similar mechanism has be transposed in an algorithm for solving op-timization problems.

Semantic Web reasoning systems deal with growing amounts of distributed, dy-namic resources. Swarm intelligence could be used in order to implement a RDFgraph traversal algorithm. Among the main properties of swarms are adaptive-ness, robustness and scalability. These correspond to three concepts - no centralcontrol, locality and simplicity. Thus, the combination of reasoning and swarmintelligence can be a viable solution for obtaining reasoning performance by ba-sic means.

A model of a decentralized system implies the traversal of a graph in orderto calculate the deductive closure of the graph, with respect to the RDFS se-mantics. The role of swarm intelligence is to reduce the computational cost. Inorder to calculate the RDFS closure over an RDF graph, a set of rules needto be applied repeatedly to the triples in the graph. In the metaphor of ants,each insect represent one of these rules, which might be (partially) instantiated.Ants communicate with each other only locally and indirectly. Whenever thecondition of a rule matches the node an ant is on, it locally adds the newlyderived triple to the graph. Only the active reasoning rules are moving in thenetwork and not the data, minimizing network traffic, since schema-data is farless numerous than instance-data. Having some transition capabilities betweengraph-boundaries, the method converges towards the closure. This model hasbeen successfully implemented and the results are described in [9].

Conclusions

The Semantic Web vision comes with the promise of a world in which a commonunderstanding of the meaning of data can help humans and computers coop-erate. However, it takes great effort to put in practice the revolutionary ideas,since it is very difficult to agree upon standards and, afterwards, to update theexisting resources according to the potential standards. For the moment, theSemantic Web seems to be scattered into small pieces, being available only ona small scale and for very specific domains. On the other hand, there is a hugeamount of knowledge that can be exploited through automated processing andadapted in order to be used. In this context, methods inspired from nature seemto have the potential to address the currently unresolved problems of the Se-mantic Web. These methods can deal with numerous data and can be used tobuild high scalable applications. Since there is no perfection, therefore, no op-timal solution for these problems, concepts such as genetic algorithms, artificialneural networks or swarm intelligence might be able to provide good results.

This paper presented some ideas of applying nature inspired methods in or-der to deal with Semantic Web’s challenges. The main aspects of Semantic Webhave been described, as well as the evolution during the past decade. The ar-eas of interest and some (potential) applications have been presented and themost important problems have been introduced and explained. Finally, the pa-per presented the way methods inspired from nature can address the problemsof Semantic Web. Three of the most important techniques - genetic algorithms,swarm intelligence, artificial neural networks - have been briefly described, alongwith the efforts of applying them for solving problems such as ontology mapping,RDF path optimization, RDF graph traversal, ontology alignment optimization.

References

[1] Tim Berners-Lee, James Hendler and Ora Lassila. The Semantic Web. ScientificAmerican, May 2001.

[2] Tom Gruber. What is an Ontology? http://www-ksl.stanford.edu/kst/what-is-an-ontology.html

[3] Semantic Web Layer Cake. http://c2.com/cgi/wiki?SemanticWebLayerCake[4] Alex Iskold. Semantic Web: Difficulties with the Classic Approach.

http://www.readwriteweb.com[5] Gene Ontology Project. http://www.geneontology.org/[6] Protege. http://protege.stanford.edu/[7] Kowari. http://kowari.sourceforge.net/[8] Jorge Martinez-Gil, Enrique Alba, Jose F. Aldana Montes. Optimizing Ontology

Alignments by Using Genetic Algorithms. Proceedings of Nature inspired Reasoningfor the Semantic Web (NatuReS), 2008.

[9] Kathrin Dentler, Stefan Schlobach, Christophe Guret. Semantic Web Reasoning bySwarm Intelligence. Vrije Universiteit Amsterdam

[10] Human Genome Project.http://www.ornl.gov/sci/techresources/Human Genome/home.shtml

[11] Riccardo Leardi. Nature Inspired Methods in Chemometrics: Genetic Algorithmsand Artificial Neural Networks, Volume 23 (Data Handling in Science and Technol-ogy). Elsevier BV, 2003.

[12] Alexander Hogenboom, Viorel Milea, Flavius Frasincar, Uzay Kaymak. GeneticAlgorithms for RDF Query Path Optimization. Proceedings of Nature inspired Rea-soning for the Semantic Web (NatuReS), 2008.

[13] Thomas Weise, Steffen Bleul, Diana Comes, and Kurt Geihs. Different Approachesto Semantic Web Service Composition. WowKiVS, 2009.

[14] http://en.wikipedia.org/wiki/Artificial neural network[15] John Cardiff. The Evolution of the Semantic Web. Social Media Research Group,

Institute of Technology Tallaght, Dublin, Ireland.