[IEEE 2013 7th International Conference on Application of Information and Communication Technologies...

The Method of Concept Formation for Semantic Search

Manana Khachidze

I.Javakhishvili Tbilisi State University, Faculty of Exact and

Natural Sciences, TSU B 11, Universities str.3, Tbilisi, 0143,

Georgia Email: [email protected]

Maia Archuadze I.Javakhishvili Tbilisi State

University, Faculty of Exact and Natural Sciences, TSU B 11,

Universities str.3, Tbilisi, 0143, Georgia

Email: [email protected]

Gela besiashvili I.Javakhishvili Tbilisi State

University, Faculty of Exact and Natural Sciences, TSU B 11,

Universities str.3, Tbilisi, 0143, Georgia

Email: gela [email protected]

Abstract— The article considers the method of forming concepts forming can be used in semantic search of information. The said method has been worked out by way of synthesizing two existing methods: Explicit Semantic Analysis (ESA) and analytical heuristics. The article examines a new and more convenient method for describing a concept (its attributes) and for calculating concepts. This method describes concepts more compactly and semantically more precisely. The described method together with other methods can be used in conceptual search.

Index Terms— Concept, Semantic search.

I. INTRODUCTION At present we constantly witness emerging of different searching systems in the consumer markets of information technologies, however creation of new algorithms for information search is still very important. In this respect receiving information relevant to consumer request in the ever increasing information flows gains significance. There are two main tasks here: a) answer to the request should be as exact as possible, and b) it should be maximally fast.

In order to achieve semantically exact answer it is necessary that the consumer words the request formalization very well describing its content. We studied a number of methods partly performing this task, however most of them are based on statistical performance of knowledge presented in big volume texts. Explicit Semantic Analysis (ESA) method differs greatly from existing methods based on statistical analysis, it presents natural language texts with indefinite volumes in a semantically correct manner [1]. This method is mainly based on concept formation by way of analyzing articles in Wikipedia.

Considering concepts and methods of their formation is interesting in the search of Ontology based knowledge representation [2]. Formal view on lexical ontology brings us to semantic network, points of which are concepts of natural languages, while bonds represent different dependencies existing in languages [3].

Ontology includes the abstract part, which represents the high level generalization of knowledge, and applied part performing the concrete pragmatic tasks of ontology. But knowledge is complete system about universe, which should be described by ontology methods and tools, and because of it within ontology the three level hierarchies was developed.

The three-level hierarchic ontology presents semantic network, points of which are the lexical signs of natural language‘s categories and concepts, and they are connected based on semantic reasons. The lexical unit is calling concepts, and they could be interpreted by natural language vocabulary or by expert system. In ontology concept is represented by language notion, which has a knowledge content participating in formation of man’s understanding of universe (internal and external). Therefore is of utmost importance to accurately determine the concept.

Concepts and connections are created by expert system [4]. Following the three-level hierarchical ontology model connects might be formal (part, integer, element-multitude), and informal (with signs). According to logical interactions they could be binary or multiply and they are united within ontology in one universal predict with A, B connection. There A and B represent lexical concepts, which might be united in relevant clusters.

Studies that were carried out back in the 60-s of the previous century and focused on modeling of concepts formation are still very important and new or modified models could open new realities in concepts applications [5, 6]. No less interesting is the analytical heuristic method for concept formation, image recognition and objects classification [7, 8]. This method was successfully applied in tasks of knowledge base formation for different expert systems [9, 10, 11].

II. CONCEPT BASED METHOD IN SEARCH Concept based method of information search introduces

alternate approach based on real perception of universe by humans. During concept based search representation of object and request by concepts differs from representation by key

words or terms (BOW method) [12]. Searching done in concept space is less dependent on specific terms or key words. This fact underlines the advantage of the method. It is also well known that during search the problem of homonyms is also quite important. Search by concepts will increase the relevance of results, and they will be more precise and complete. One of the best examples of concept based representation of knowledge in mathematics is described in Lenat’s work [13].

In the work by Vygotsky [14] it is stated that perception of universe by humans, as well as objects perception and accumulation of knowledge is implemented by the concept based method. Concept is the main unit for forming the worldview and perception of the universe by humans, and also for accumulation and representing the knowledge [15].

At present some methods of concept based information search are known, such as: ESA (Explicit Semantic Analysis) [16], WordNet [17], LSA (Latent Semantic Analysis) [18], WikiRelate [19].

Concepts, as well as their representation instruments, are also a part of algebraic theories. So, the Formal Concept Analyzis (FCA) is applied in the theory of algebraic lattices through which mathematical formalism is presented in this theory. Concepts and their hierarchy are described by algebraic language. The main idea of FCA method was presented by Rudolf Wille [20]. In 1999 Wille in cooperation with Ganter published the most complete monograph on the subject [21]. In fact, the framework of FCA includes data represented by objective value forms, while formal concepts are determined by Galois conformities (rules). They are represented in pairs (volume, content). The volume and content of the object are situated in columns and rows and as a result we get tables. During the last 30 years this theory has covered a lot of ground from theoretical research to a great number of practical applications. Therefore this theory is considered a vigorously developing field of applied mathematics. It has a wide range of applications: in analyzing data (machine learning and data processing), knowledge representation (ontology, taxonomy), information search, analysis of unstructured data (texts), software engineering, sociology, etc.

In this paper we present our methodology of concept formation which was created by merging Explicit Semantic Analysis and analytical heuristics methods. It could be easy integrated with other methods and used for applied information search. Texts prepared on the basis of natural languages could be processing by computer using elaboration and formation of concepts responsible of semantic search tasks.

III. SYNTHESIZING METHODS OF ANALYTICAL HEURISTICS In order to apply analytical heuristic method [8] it is

necessary first of all to clearly define main elements and concepts used in this method.

We should take the set of signs/properties and connect it to relevant ensemble of words of some language. Each of these sets can be represented as union of eight-ten sub-sets of relevant words Ai, which belong to speech parts [lexical class (noun, verb, adjective, adverb, pronoun, preposition, conjunction, and interjection)]. The number of sub-sets depends of specific properties of concrete language (for the English the number is eight, while for the Georgian language this number is ten). It is necessary to underline that each

is a set of elements, where : Ni is the number of words in concrete Ai part of speech.

We should represent one of the describing texts of concrete C concept as a set , where is all different words in the examined text, and M is the number of different words in the same text. Surely there exists another describing text of the concrete C concept. But every such kind of sets is represents a sub-set of the initial set. We can extend this set to AL-Set, and in this case we can fully use the method of analytical heuristics.

ESA method uses Wikipedia as source of data and [1], mainly any concept (let us present this concept as cj, j=1,…,Nwik, where Nwik is the number of concepts in Wikipedia storage) is determined by one text, which we can present as . (Here we used the upper index as count that we are using the specific storage, for instant Wikipedia storage). Each text of this kind is presented as sum of weights, particularly as the so called TFIDF scheme [22]. Semantic transformer implements iteration of words of texts takes relevant record consisting of inverted indexes and unites them in concept vector, which represents the text.

Following the procedure described in [16] let us assume the text describing cj concept as wi words ensemble: , i=1,…, Mwik, where Mwik is the number of words in Wikipedia storage. This set corresponds to TF-IDF vector, where each wi word has its weight.

The weight of a word (term) depends on its frequency. One and the same term can have different weight in a different document. In the vector space model the weights of a term (in the terms space) in a document are considered its coordinates. By uniting the documents and terms space we get document-term matrix.

To determine weights of a term, the weight automatic generation scheme (term frequency/inverse document frequency) is used. It is represented as “tf * idf”.

In the similar manner the vector is built, where

is inverted index of wi word for cj concept of relevant text from Wikipedia storage. The cj concept in the storage which corresponds to the text

For the text to which corresponds cj concept in the storage, we can have weight vector as follows:

Let us describe the cj concept we received in the form

acceptable for Analytical Heuristics Method [8]. Let us simplify the marks and assume that each cj is generally described by words. The number of words participating in the description process depends on researcher’s (developer of the concept) point of view. Different approaches can be used to determine the number. The approach will depend on text volume, number of words with different meaning, etc. Each presence of word in the description is determined by weight vector of text

corresponding to concept. Getting of the words into representation of cj concept is

determined not only by their weight but also by their being this or that part of speech. In other words, by which element of set sub-ensembles Ai they are represented.

Majority of researchers give advantage to nouns, pairs: noun-pronoun, noun-verb. In order to define pairs of words the vector of weight parameter of the words is not enough. We will consider this process in the next papers.

Let us enlarge the concept elaboration space and use other storages like Wikipedia: www.AllRefer.com www.bartleby.com www.britannica.com www.infoplease.com www.Encyclopedia.com techweb.com/encyclopedia libraryspot.com/encyclopedias.htm education.yahoo.com. In all these storages there are texts corresponding to cj concepts. So ESA method with later description with

“words” can be used for these storages as well (here L is the number of words in particular storage). If we perform this procedure for all existing storages we will receive some possibly different descriptions of one cj concept. For every text corresponding to cj concept we will have the vector of weights.

It is obvious that words in different “vectors” describing each cj concept are repeated. Let us join these words and get one general plurality of words in storages:

,

Table 1. Description of one cj concept in different storages

Storages cj concept description

Wikipedia x … … y

were max is maximal number of all different words in all

the storages. Let us assume this number is N. In this case we can present all vectors described by cj concept by one N length vector, elements of which are: , i=1, …N

Accordingly Table I will acquire unified from: Table 2. Unified description of cj concept in different

storages Storage .

… … … …

… … … … … … … … …

… … … … … … …

… … Where

Since description of each cj concept is finite, so is the

plurality. This space can be represented as Al-set without any limitations. In this space all operations (that can be explained in such pluralities) can be explained. Proceeding from the above, each k-type realization of cj concept can be easily represented as a common implicant. For instance we present different descriptions of certain c concept:

Let us imagine cj full realization of disjunctive normal

form:

By minimizing the disjunctive normal form we get the

generalized representation of cj concept. Actually much more words take part in concept

representation, but using the vector prepared by Explicit Semantic Analysis method we could select the highest weight words. Value determination should depend on the number of words in text in order to differentiate words number ratio. This approach will increase significance of final semantic importance of concept description. Semantic adequacy of results depends on the number of words. On the other hand, a big number of words complicates difficult using concept for information search.

IV. PRESENTING SYNONYMS IN CONCEPTS During the process of semantic analysis one of the most

important tasks is determination of synonyms. If this consideration is not taken into account, it may have very strong influence on concepts formation. As texts from different sources are used for forming concepts, there is high probability that in different textual descriptions of same concepts there are synonyms. At the same time, this feature should be taken into account for the databases of texts, where the searching process is implemented. One of the possible ways of solving this problem is the so-called quantum coefficients and their utilization.

Let us consider general plurality W of the words in storages as a space of quantum dimension. In such a space each element can be represented as “Schrodinger’s cat“ [23]. In case a word has synonyms it will allow us to have as many positions of a “live cat” as there are synonyms to the word.

If we take this fact into account, then in description will be

“Schrodinger’s cat“. And thus in describing the cj concept, if word exists, we will consider its superposition. This approach is hypothetical yet. Work is under way on

bringing it to a full algorithm.

V. TESTING OF THE METHOD To evaluate the effectiveness of presented method the test

experiment was done. The testing process included two steps: 1. Formation of concepts; 2. As the concept is implicant (disjunction, conjunction) it matches the Bugle search algorithm [22]. The method was evaluated exactly by this algorithm. Texts on various topic matters from different information storages for 5 concepts were selected, the total of 70 texts. For each concept 10 most high-weight words were selected, and based on these words implicants were prepared

for each describing text. For each concept 10-16 different description texts were processed. Following the above method, the concepts were formed. Different normal disjunction form descriptions of 5 concepts were received.

Different concepts were used for search in 300 different text storages was performed. These were not the texts previously used for the concepts formation. For each concept there were 42-65 texts selected. Based on the concepts received, the search procedure was performed for each concept separately. The search correction ranges from 0,81 to 0,92.

VI. CONCLUSION The testing and evaluation of the method has shown that

elaboration of concepts by suggested method generally describes their semantic nature and properties. This method allows, based on concept’s descriptive non-structured metadata (texts), the formation of the structure with generalized semantic nature. We believe this structure is the one of the main components of information search system.

Currently we are in the process of increasing the number of base texts for concepts formation and selecting the optimal quantity of descriptive words. Based on the above the search algorithm will be optimized.

REFERENCES [1] Gabrilovich, E. and Markovitch, S. “Computing Semantic

Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In 20th International Joint Conference on Artificial Intelligence (IJCAI’07) proceedings of international conference in Hyderabad, India, January 6-12, 2007, Morgan Kaufmann Publishers, pp. 1606–1611.

[2] Archuadze M., Besiashvili G., khachidze M. and Kervalishvili P. “Knowledge Engineering: Quantum Approach”, published in Philosophy and Synergy of information: Sustainability and Security, Publication is supported by: The NATO Science for Peace and Security programme Sub-Series E:Human and Societal Dynamic-, 2012, vol.93 ISSN 1874-6268, pp.175-185.

[3] Jun Zhai, Yiduo Liang, Yi Yu and Jiatao Jiang, “Semantic Information Retrieval Based on Fuzzy Ontology for Electronic Commerce”, School of Management, Dalian Maritime University, Dalian 116026, P. R. China, Journal of Software, 2008, Vol. 3., N9, pp. 20-29.

[4] Alonso JM., Magdalena L. “A Conceptual Framework for Understanding a Fuzzy System”, Proceedings of the Joint 2009 International Fuzzy Systems Association World Congress and 2009 European Society of Fuzzy Logic and Technology Conference, Lisbon, Portugal, July 20-24, pp. 119–124.

[5] Hunt, E.B. Concept Learning: An Information Processing Problem. New York: Wiley. 1962.

[6] Hunt, E.B.,Sternberg, R. Metaphors of Mind: Conceptions of the Nature of Intelligence. Cambridge: Cambridge University Press.1990.

[7] Chavchanidze, V. "Towards the General Theory of Conceptual Systems: (A New Point of View)", Kybernetes, 1974, Vol. 3 Is: 1, pp.17 – 25.

[8] Chavhcanidze, V. “Heuristic Analysis of Artificial Intelligence in the Formation of Concepts, Pattern Recognition and Classification of Objects”, p.20 Institute of Cybernetics, Georgian Academy of Sciences, 1970, dep. 2080-70, Tbilisi.

[9] Kvinikhidze K.S., Chavchanidze V.V. “Application of Conceptual Approach to Describe the Evolution of Protein Structure”. Report of the 8-th International Congress on Cybernetics, Namur, French. 1976.

[10] Khachidze M. “Artinformatic Knowledge and Some Ways of Its Presentation”. Bulletin of Georgian Academy of Sciences, 1998, vol. 165, no. 6, pp. 60-65.

[11] Mikeladze M., Khachidze M. “Modified Conceptual-Probabilistic Method of Formation Rules for Medical Diagnostic Expert Systems”, Proceedings of the XIV International Symposium “Large System Control”, Tbilisi, 2000, pp. 162-163.

[12] Salton G., Buckley C. “Term-weighting approaches in automatic text re trieval”, Information Processing and Management, 1988, vol. 24 (5),pp. 513–523.

[13] Davis R. And Lenat D. Knowledge-Based Systems in Artificial Intelligence. McGraw-Hill Advanced Computer Science Series. 1982.

[14] Vygotsky L.S. Thought and Language, MIT press, Massachusetts, 1986.

[15] Bolton N. Concept Formation, Pergamon Press, Durhan. 1977. [16] Egozi O., Markovitch S. and GabrilovichE. “Concept-Based

Information Retrieval using Explicit Semantic Analysis”, ACM Transactions on Information Systems, Vol. 29, No. 2, Article 8, Publication date: April 2011.

[17] Budanitsky A. and Hirst G. “Evaluating Wordnet-Based Measures of Lexical Semantic Relatedness”, Computational Linguistics, 2006, Vol. 32 Issue 1, pp. 13-47.

[18] Deerwester S., Dumais S., Furnas G., Landauer T. and Harshman R “Indexing by latent Semantic Analysis”, Journal of the American Society for Information Science, 1990, Vol. 41 Num 6, pp. 391–407.

[19] Strube M., Ponzetto S. P. “Wikirelate! Computing semantic relatedness using Wikipedia”, Proceedings of the 21st National Conference on Artificial Intelligence, 2006, Vol. 2, AAAI Press, Boston, MA, pp.1419-1424.

[20] Wille R. “Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts”, in Ordered Sets, Ed. by I. Rival, Dordrecht; Boston: Redial, 1982, pp. 445–470.

[21] Ganter B. and Wille R., Formal Concept Analysis: Mathematical Founda-tions, Springer. 1999.

[22] Manning D., Raghavan P., Schütze H.”Scoring, term weighting and thevector space model”, in Introduction to Information Retrieval,Cambridge University Press, 2008, pp. 109-133.

[23] Schrödinger E. (November 1935). "Die gegenwärtige Situation in der Quantenmechanik (The present situation in quantum mechanics)". Naturwissenschaften.

[IEEE 2013 7th International Conference on Application of Information and Communication Technologies...

Documents

Transcript of [IEEE 2013 7th International Conference on Application of Information and Communication Technologies...