wiki.cis.unisa.edu.au · Web viewImproving Rank Algorithm of Search Engine with Ontology and...
Transcript of wiki.cis.unisa.edu.au · Web viewImproving Rank Algorithm of Search Engine with Ontology and...
Improving Rank Algorithm of Search Engine with
Ontology and Categorization
By
Qiaowei Dai
A thesis submitted for the degree of
Master of Science (Computer and Information Science)
School of Computer and Information Science
Division of Information Technology, Engineering and the Environment
SupervisorDr. Jiuyong Li
1st June 2009
University of South Australia
Contents
INTRODUCTION...................................................................................................................................10
1.1 BACKGROUND................................................................................................................................10
1.2 MOTIVATION..................................................................................................................................13
1.3 RESEARCH AIM..............................................................................................................................13
1.4 SCOPE............................................................................................................................................14
1.5 THESIS ORGANIZATION..................................................................................................................14
TRADITIONAL LINK STRUCTURE-BASED RANK ALGORITHMS........................................15
2.1 OBJECTIVE AND USAGE OF HYPERLINK EXISTING IN WEB...........................................................15
2.2 PAGERANK ALGORITHM................................................................................................................16
2.2.1 Simplified PageRank Algorithm............................................................................................182.2.2 Improved PageRank Algorithm.............................................................................................19
2.3 HITS ALGORITHM.........................................................................................................................21
2.3.1 Analysis of HITS Algorithm...................................................................................................222.3.2 Analysis of HITS Link............................................................................................................23
3.4 SUMMARY......................................................................................................................................25
LITERATURE REVIEW......................................................................................................................26
3.1 RESEARCH ON TRADITIONAL RANK ALGORITHMS OF SEARCH ENGINES.....................................26
3.1.1 Problems and Improvements of PageRank Algorithm..........................................................263.1.2 Problems and Improvements of Hits Algorithm....................................................................30
3.2 TRADITIONAL DOMAIN ONTOLOGY-BASED CONCEPT SEMANTIC SIMILARITY COMPUTATION.....33
3.2.1 Ontology................................................................................................................................333.2.2 Three Main Semantic Similarity Computation Models.........................................................34
3.3 SUMMARY......................................................................................................................................34
METHODOLOGY................................................................................................................................36
4.1 RESEARCH QUESTIONS..................................................................................................................36
4.2 RESEARCH STRATEGY....................................................................................................................36
4.3 EVALUATION TOOLS......................................................................................................................39
4.3.1 The Ontology Tool.................................................................................................................39
IMPROVE CONCEPT SEMANTIC SIMILARITY COMPUTATION MODEL..........................40
2
5.1 DISCUSSION ON TRADITIONAL COMPUTATION MODELS...............................................................40
5.1.1 Distance-based Semantic Similarity Computation Model....................................................405.1.2 Content-based Semantic Similarity Computation Model......................................................415.1.3 Attribute-based Semantic Similarity Computation Model....................................................41
5.2 DECISION FACTS OF SEMANTIC SIMILARITY COMPUTATION.........................................................42
5.2.1 Directed Edge Category........................................................................................................435.2.2 Directed Edge Depth.............................................................................................................445.2.3 Directed Edge Density...........................................................................................................455.2.4 Directed Edge Strength.........................................................................................................455.2.5 Concept Node Attribute of the Two Side of Directed Edge...................................................46
5.3 ESTABLISHMENT OF IMPROVED COMPUTATION MODEL................................................................46
5.4 EVALUATION OF IMPROVED COMPUTATION MODEL......................................................................48
5.5 SUMMARY......................................................................................................................................50
IMPROVE RANK ALGORITHM BASED ON CATEGORIZATION TECHNOLOGY.............51
6.1 COMBINATION OF CATEGORIZATION TECHNOLOGY AND LINK STRUCTURE BASED ALGORITHM.51
6.2 BASIC IDEA OF CATEGORIZATION..................................................................................................54
6.2.1 Implementation of Categorization.........................................................................................546.2.2 Pre-categorization Processes................................................................................................55
6.2.2.1 Pre-categorization of Web Pages............................................................................................55
6.2.2.2 Pre-categorization of Keywords...............................................................................................55
6.3 MODELING.....................................................................................................................................58
6.3.1 Category Selective Mechanism.............................................................................................586.3.2 Integrating HITS Algorithm with Categorization.................................................................59
6.4 EVALUATION OF INTEGRATED ALGORITHM...................................................................................61
6.4 SUMMARY......................................................................................................................................62
CONCLUSION......................................................................................................................................63
7.1 SUMMARY OF CONTRIBUTIONS......................................................................................................63
7.2 FUTURE WORK...............................................................................................................................64
REFERENCES.......................................................................................................................................65
SEMANTIC RELATIONS....................................................................................................................68
“Linear Structure” Ontology....................................................................................................................71
3
List of Figures
Figure 1 Directed link graph ...........................................................................17
Figure 2 Simplified PageRank Algorithm...........................................................19
Figure 3 Improved PageRank Algorithm............................................................21
Figure 4 HITS algorithm on six-nodes graph......................................................24
Figure 5 Function image of ............................................................................33
Figure 6 Research strategy framework................................................................38
Figure 7 Main interface of Protégé 3.4................................................................39
Figure 8 Ontology of linear structure..................................................................49
Figure 9 Screenshot of Linear Structure..............................................................49
Figure 10 Pre-categorization framework.............................................................56
Figure 11 Relation between authority value and hub value................................58
4
List of Tables
Table 1 PageRank of each node in Figure 1........................................................21Table 2 Semantic relations...................................................................................43
Table 3 Experimental result.................................................................................49
5
Abstract
The appearance and rapid development of Internet has greatly changed the
environment of information retrieval. However, the rank algorithms for search engine
based on Internet are directly related to experiences in using when users perform
information retrievals in the new environment.
The existing rank algorithms for search engine are mainly based on the link structure
of web pages, and the two main representative algorithms are PageRank algorithm
and HITS algorithm. Many scholars and research institutions have made new
explorations and improvements based on these two algorithms, and some mature
integrated rank models suitable for search engines were generated.
In this paper, we study the shortcomings of search engines, and provide further
analysis on PageRank algorithm and Hits algorithm. Beside, we discuss the existing
improved algorithms based on link structure, and provide analysis on the
improvement ideas of existing search engine rank technology. Moreover, research on
traditional concept semantic similarity computation models based on domain ontology
is given as well.
According to the characteristics and shortcomings of existing models and algorithms,
we firstly propose an improved concept semantic similarity computation model. Then,
an improved rank algorithm which integrating categorization technology and
traditional link analysis algorithm based on it is given in this paper, which improves
HITS algorithm in two aspects, the pre-processing of Web pages and analysis on the
link structure of Web page. At last, the evaluations are provided as well.
6
Declaration
I declare that:
this thesis presents work carried out by myself and does not incorporate without
acknowledgment any material previously submitted for a degree or diploma in any
university;
to the best of my knowledge it does not contain any materials previously published
or written by another person except where due reference is made in the text; and all
substantive contributions by others to the work presented, including jointly authored
publications, is clearly acknowledged.
Qiaowei Dai
1st June 2009
7
Acknowledgements
I wish to express my sincere gratitude to my master thesis supervisor Dr. Jiuyong Li,
who is a Lecturer in the School of Computer and Information Science, for his helpful
suggestions, unreserved support, and encouragement throughout the research and
writing of this thesis. Besides this, I would also like to thank my course coordinator,
Dr. Stewart Von Itzstein for his encouragement and support. Last but not the least I
would like to express deepest thanks to my family for giving me the courage and their
support to study in Adelaide.
8
CHAPTER 1
Introduction
1.1 Background
Search engines have gradually become a high efficient and convenient way for data
query and information acquisition to people. With the continuous development of
search engine technology, the current mature commercial search engines have
experienced several generations of evolution. Meanwhile, Web information retrieval
technology, which is the essence of search engines, including commercial products
has come out for about 20 years. In this period of time, great progresses in the aspects
of retrieval key technology, system structure design, query algorithm and etc. are
made, and a lot of commercial search engine services are being used on Web.
Compare with these progresses, the rapid increment of data on Web weakens the
achievement obtained in the research field of Web search in some degree, the massive
data quantity and frequent update speed have brought a completely new challenge as
well. Currently, the shortcomings existing in Web information retrieval are mainly
shown in the following aspects according to my research:
Low query quality
Low query quality is shown as when returning large amount of result pages,
however, the amount that really accords to users’ requirement is low. Moreover,
most of these relevant links don’t appear on the top of query results. Users have to
keep trying and turning pages in order to find valuable information, thus a lot of
9
time is consumed by this process. In the age that Web information amount is
increasing continuously, this problem has become particularly outstanding.
Improving Web query quality is the most critical subject of current intelligent
information retrieval research, after Web mining technology is integrated, the
query quality of search engines can obtain great improvement.
Low query update speed
There are two reasons causing the low update speed of Web query results, one is
the low efficiency of the Crawler system of search engines, which the collection
period of documents is too long, after the index is completed, difference has
emerged between acquired content and the newest pages; the other one is the
update speed of Web documents has become faster and faster. Currently, many
Websites include dynamic pages, which are activated by the background database,
thus the change of database will directly cause these dynamic pages to be
changed. The update speed of part of static pages is increasing as well. When
many Web pages are continuously visited by Web Crawler by two times, the
change times of them will much higher than two times in the interval, so users
can’t obtain the content of these changes through query.
Lack of effective information categorization
Currently, most of the query results of search engines are provided in the way of
list and paging, all the relevant and irrelevant links are put together without
association, which is quite inconvenient for users with explicit query objective,
because they have to keep jumping or selecting between various links.
Categorizing and clustering query pages is an effective way to improve the quality
of user navigation, which can make users select some category quickly and
ulteriorly refine query targets in this category. For example, if we input “mining”
into Vivisimo, several categories such as “data mining”, “gold” and “Mining and
Metallurgy” will emerge, and users can make further query in every category.
10
Keyword-based Web query lacks understanding of user behavior
In the view of the development of Web retrieval technology, keyword-based query
will be the most important retrieval way in a quite long period from now on.
Keyword-based query is a complicated retrieval mechanism implemented by the
Boolean combination of keywords. However, the query functions provided by
current search engines are quite limited, which only the most basic Boolean
connections between keywords are provided by most search engines. For instance,
Yahoo only provides two logical operators, which are “AND” and “OR”, and
compulsory applies one logical operator to all keywords. In many cases, it is quite
difficult to construct an effective query combination.
On the other hand, even to the same keywords, the search objective of different
users maybe different, it is closely related to the facts such as users’ personal
preference, the environment of context of current search, the previous search
history and so on. After these parameters are fully considered, a search engine that
accords to users’ requirement can be designed based on it. In Lawrence and Lee
Guile’s (1998) paper, they proposed a context environment-based Web retrieval
and query correction method.
Low index coverage rate of Web search engines
Currently, the coverage rate to Web of search engines is low than 50%, it is quite
difficult to completely index the whole Web because of resource restriction. In the
condition that the index coverage rate is low, when collecting documents, many
search services adopt same download priority for each page, which causes there
are many pages with low reference value remaining in index database, but some
relatively important pages are not indexed. In order to solve this problem,
discrimination of resource quality is needed in the process of Crawler traversing.
The pages with high quality should be downloaded in priority, and the index
database is constructed according to priority. In Chakrabarti, van den Berg and
Dom’s (1999) paper, they proposed an algorithm that analyzes Web document
11
quality in real time and determine download priority by means of focus crawling,
which makes up the shortcoming of low coverage rate in some degree.
1.2 Motivation
According to Intelligent Surfer Model, we can consider that the user behaviors in
browning Web page are not absolute random or blind, but related to topic. That is to
the numerous outbound links of each Web page, the outbound links which belong to
the same or similar Web page category will have the higher click rate.
No matter PageRank algorithm or HITS algorithm, they objectively describe the
essential characteristics between Web pages, but rarely consider about the topic
relativity of users’ surfer habit. Link structure-based algorithm can be integrated with
other technology very well in order to improve the algorithm adaptability.
Categorization technology can simulate user subject-related habit, so as to improve
this kind of link structure-based algorithm. Categorization technology overcomes the
unreliability brought by the assumption that the users’ behaviour of visiting Web
pages is absolute random, and distinguishes the direction relation between Web pages
according to category attributes, thus categorization technology can be regarded as an
important supplementary to traditional algorithms.
1.3 Research Aim
My research aim is to establish an improved rank algorithm for search engines based
on domain ontology and categorization technology in order to make rank algorithm
simulate the actual user behaviors in browsing Web pages more accurate. To achieve
this research aim, we have three objectives.
The first objective is to analyze the traditional link structure-based rank algorithms
for search engines in order to gain an insight into their principles and further studies.
Thought the research and analysis on traditional domain ontology-based concept
12
semantic similarity computation, we can gain a full understanding of the principles
and weaknesses of three common computation models. Therefore, our second
objective is to analyze and improve the decision facts of concept semantic similarity
computation, and then develop an improved concept semantic similarity computation
model in order to determine the relation between two categories in the categorization
process.
The third objective is, according to the study on category-integrated PageRank
algorithm, we firstly perform a pre-categorization process to Web pages and keywords
based on the improved concept semantic similarity computation model, and then
develop a category-based HITS algorithm to satisfy the final aim of this thesis.
1.4 Scope
This thesis will focus on researching the rank algorithm for search engines, which
needs to be able to adapt the link structure of network, and gives accurate feedback to
the information queried by user. A good rank algorithm should be able to filter the
content of Web pages, reject irrelevant Web pages, and displays the Web pages which
are most relevant and close to query condition to the top of the list. Meanwhile, the
waiting time of this kind of rank computation should be in user’s acceptable scope.
1.5 Thesis Organization
The thesis is structured as follow:
1 Chapter 2 (Traditional Link Structure-based Rank Algorithms)
2 Chapter 3 (Literature Review)
3 Chapter 4 (Methodology)
4 Chapter 5 (Improve Concept Semantic Similarity Computation Model)
5 Chapter 6 (Improve Rank Algorithm Based on Categorization Technology)
6 Chapter 7 (Conclusion)
13
CHAPTER 2
Traditional Link Structure-based Rank Algorithms
Hyperlink is a very important component of Web. Through the hyperlink in a page,
users can arbitrarily link from a page of any WWW server in the world to the page of
another WWW server. Hyperlink not only provides convenient information
navigation, but also is an information organization method, which includes help
information that is very rich and effective to Web information retrieval.
2.1 Objective and Usage of Hyperlink Existing in Web
In order to convenience users in intra-Website navigation, the internal hyperlinks of a
Web page can convenience users to jump between different Web pages freely, so as to
avoid to use the “back” button of Web browser. A well-designed Website should be
able to jump from an arbitrary page of the Website to the other pages of the Website
by multiple links. The main function of this kind of hyperlink is to assist users to
orderly visit the whole Website content.
Another kind of hyperlink is extra-Website hyperlink, which is the most important
hyperlink form in Web hyperlink mining research. Generally, extra-Website hyperlink
always represents the page creator’s attention and preference of some Website or
content, or say, some potential relations exist. For example, adding the hyperlink of
Yahoo to a page represents the author’s recommendation and preference of Yahoo;
adding the hyperlink of Kdnuggets, which is a famous data mining Website, to a page
represents that the page author is interested in data mining, as well as the page itself is
14
possibly related to data mining research. If the URL of some page is linked many
times in Web, it indicates that the quality of its content is high; on the contrary, the
important degree is lower. This kind of evaluation mechanism is similar to the
reference in scientific paper, the importance of the paper with more times being
referenced by other people is higher than it with less times being referenced. In Web
retrieval, besides the times being linked by other documents, the quality of source link
document is also a reference factor of evaluating the quality of linked documents,
which the document linked or recommended by the high-quality document always has
higher authority. Web can be considered as a graph structure in hyperlink analysis,
and analyze the link relations between nodes can help solve the difficult problem that
text content-based retrieval can’t achieve content quality evaluation.
Compare with the traditional search engines which use the rank algorithms based on
the query results of word frequency statistics, the advantage of hyperlink analysis-
based algorithm is that it provides an objective and cheat-proofing (Some Web
documents cheat traditional search engines by adding invisible strings) Web resource
evaluation method. Currently, link analysis algorithm is used in many Web
information acquisition fields, including rank search engine document, search related
document, arrange priority order of Web Crawler’s URL crawling, etc (Dean, J &
Henzinger, RM 1999). Recently, compared with the word frequency statistics-based
method used by traditional search engines, the Web retrieval algorithms based on
hyperlink analysis such as PageRank algorithm has great improvement in the aspect
of improving retrieval precious (Haveliwala, HT 1999).
2.2 PageRank Algorithm
PageRank is a global link analysis algorithm proposed by S. Brin and L. Page (Brin, S
& Page, L 1998). It performs statistics to the URL link condition of whole Web, and
calculates a weight, which is called as the PageRank value of this page, to every URL
according to the factors such as link times, etc. This PageRank value is fixed, not
15
changeable with the change of query keyword, which is different from the local link
analysis algorithm HITS.
Figure 1 Directed link graph For example, in Figure 1, page includes a hyperlink referring to page , there exists
. Here, the hyperlinks between pages compose a directed graph . To a
node composing directed graph in every page, if and only if when includes the
hyperlink referring to page , there exists directed edge from to .
To node , nodes , , have contributions to the weight value of , because these
three nodes all exist directed edges to . The more the directed edges referring to
some node, the higher the node (page) quality is. The main shortcoming of this kind
of algorithm is that only the link quantity is considered, which means all the links are
equivalent, but whether the quality of source node itself is high or low is not
considered. In fact, the high-quality page in Web always includes high-quality links,
to the effect of linked document quality evaluation, the impact of the quality of source
node is always high than the quantity. For example, the links appearing in Yahoo
always have certain reference value, because Yahoo itself is a relatively authoritative
Website, just as the papers issued in top publications always have higher academic
value.
PageRank algorithm is in recursive form, its value relies on the linked times and the
w
b
a
u
vc
16
PageRank value of source link (Brin, S & Page, L 1998).
2.2.1 Simplified PageRank Algorithm
Simplified PageRank algorithm implements the basic recursive procedure of link
times and source PageRank. Let the pages on Web as , , …, , is the amount
of the extra-Website links of page , is the page set referring to page . Assume
Web is a strong connected graph (actually it is impossible, this problem will be
discussed in the next section), then the PageRank value of page can be expressed by:
The expression above can be written as , is the vector of , the arbitrary
element in matrix , which . If page refers to , then . Thus, vector
is the eigenvector of matrix . Because Web is assumed to be strong connected,
the eigenvalue of is .
From the definition above, we can find that PageRank is accord with Random Surfer
Model (Page, L, Brin, S & Motwani, R 1998). We can consider Random Surfer Model
in this way: Assume a user visits Web page by means of randomly clicking
hyperlinks, moreover, he doesn’t use “back” function and keeps continuous clicking.
The PageRank of page is essentially the probability of clicking page in the process
that a user browses the whole Web by means of random surfer. Motwani, R &
Raghavan, P (1995) had made further research on RSM, these works can be also used
to analyze the Web link attributes.
The computation of simplified PageRank algorithm can use iterative method, after
several times of iteration, stop the iterative procedure when the PageRank value
converge to the condition that deviation is small enough. For example, in Figure 2, the
computation procedure is shown below:
Select arbitrary random vector
17
7
8 If ( is the selected iterative threshold value), stop iteration. is the
PangeRank vector
9 , back to step 2
Figure 2 shows the computed rank value of every node in a small graph structure by
simplified PageRank algorithm. According to the RSM of PageRank, the sum of the
PageRank value of every node is .
Figure 2 Simplified PageRank Algorithm
2.2.2 Improved PageRank Algorithm
Simplified PageRank algorithm is only suitable for the ideal strong connected
environment, but in fact, Web is not a strong connected structure. Broder, A, Kumar,
R, & Maghoul, F’s (2000) paper shows there are only 28% pages on Web are strong
connected; 44% are one-way connected; and the remaining part forms Information
Isolated Island, which is neither linked by, nor links to other page. To simplify
PageRank algorithm, non strong connected Web exists two inextricable problems,
which are rank sink and rank leak. Rank sink refers to some local strong connected
Web graph doesn’t include the link referring to outside. Rank leak refers to the page
2
1 3
5 4
r2=0.286
r1=0.286 R3=0.143
r5=0.143 r4=0.143
18
that doesn’t include any external hyperlink. Actually, it is a special case of rank sink
when there is only one node in the strong connected graph. They will cause deviation
generating when analyzing graph structure. For example, if we discard the link from 5
to 1 in Figure 2, nodes 4 and 5 will form rank sink situation. If we use RSM to
simulate, we will fall into the dead circulation from 4 to 5 at last. Moreover, the rank
values of 1, 2 and 3 tend to 0, and the nodes 4 and 5 will share the rank, which the
total value is 1, of whole graph. If we remove 5 and its related links form figure 2,
node 4 will become a leak node. Because once this node is visited, the rank procedure
will stop here, thus, the rank values of all nodes will converge to 0. Therefore, Page
and Brin (Brin, S & Page, L 1998) proposed two methods, one is discarding all the
leak nodes which their outdegrees are 0, another one is introducing damping fact (
) in simplified PageRank algorithm. The appearance of makes PageRank
contribute to not only the node which it links to, but also the other pages on Web. The
expression of improved PageRank algorithm is shown below:
is the total node amount of Web subgraph that Web Crawler visits. As we can see
from the expression, the simplified PageRank algorithm is the special case when
.
Figure 3 shows the computed PageRank value of every node after removing the
hyperlink from 5 to 1 by improved algorithm. Every node has been adjusted by
parameter , which make their values all converge to a non 0 value.
19
Figure 3 Improved PageRank AlgorithmFor example, in Figure 1, the PageRank value of each node is shown in the table
below ( ):
Table 1 PageRank of each node in Figure 1Node a Node b Node c Node u Node v Node w
PageRank 0.060210 0.071004 0.094177 0.047534 0.097881 0.125839
PageRank can use iterative algorithm to complete recursion. To the PageRank of each
node in Figure 1, about 15 times iterations are needed. Generally, in actual
computation, 100 times iterations are enough to converge (Haveliwala, H.T 1999).
PageRank algorithm is currently applied by Google search engine, which provides
high-quality Web retrieval service.
2.3 HITS Algorithm
HITS (Hypertext Induced Topic Search) algorithm is a kind of rank algorithm that
analyzes Web resource based on local link, which is proposed by Kleinberg in 1998
(Kleinberg, J 1999). The difference between PageRank and HITS is that HITS is
2
1 3
5 4
r2=0.142
r1=0.154 R3=0.101
r5=0.290 r4=0.313
20
related to query, and PageRank is a kind of query unrelated algorithm. As mentioned
in the section above, PageRank algorithm gives each page a rank value which is
unique and unrelated to query keyword, but HITS algorithm gives each page two
values, which are Authority value and Hub value.
Authority page and Hub page are two important concepts in HITS algorithm, which
are the concepts all related to query keyword. Authority page refers to some page that
is most related to query keyword and combination (Kleinberg, J 1999). For example,
when querying “University of South Australia”, then the homepage of UniSA, which
is http://www.unisa.edu.au/, is the page with the highest Authority value in this query,
and the Authority value of other pages theoretically should be lower than it. Hub page
is the page that includes multiple Authority pages (Kleinberg, J 1999). Hub page itself
may not have direct relation to query content, but through it, the Authority page with
direct relation can be linked. For example, when inputting the query combination such
as “Australian university”, the homepage of Australian Education Network, which is
http://www.australian-universities.com/, is a Hub page, which includes the links
referring to each university in Australia. Hub page can be used as the auxiliary
reference when computing Authority page, and itself can be returned to user as query
result (Chakrabarti, B, Dom, B & Raghavan, P 1998).
2.3.1 Analysis of HITS Algorithm
The central idea of HITS algorithm is that: Firstly, use text-based retrieval algorithm
to obtain a Web subset, and the pages in this subset all have relativity to user query.
Then, HITS performs link analysis on this subset, and find out the Authority pages
and Hub pages related to query in the subset. The selection of subnet in HITS
algorithm is acquired by means of keyword matching. This subnet is defined as root
set , then use link analysis to acquire set from root set , is the page that
includes Authority page and ultimately meet the query requirement. The process from
to is called “Neighborhood Expand”, and the algorithm procedure for computing
is shown below:
21
Use text keyword matching to acquire root set , which includes thousands of URL or more;10 Define to , that is and are equal;
11 To each page in , put the hyperlinks included by into set ; put the pages
referring to into set ;
12 is the acquired expanded neighbourhood set.
HITS algorithm needs three parameters, which are query keyword, maximum
capability of root set R, and maximum capability of expanded neighborhood . After
using the algorithm above, the pages in will have more Authority pages and Hub
pages which meet the query keyword.
2.3.2 Analysis of HITS Link
The process of HITS link analysis takes advantage of the attribute that Authority and
Hub are interacting to identity them from expanded set . Assume the pages in
expanded set are respectively 1, 2, …, n. represents the page set referring to
page , represents the page set referred by page . HITS generates an authority
value and Hub value for each page in . The initial value of computing initial
and can be an arbitrary value, similar to PageRank, HITS can use iterative method
to acquire convergence value. There are two steps in its iterative procedure, which are
step I and step O. In step I, the authority value of each page is the sum of the Hub
values of pages referring to it. In step O, the Hub value of each page is the sum of the
authority values of pages referring to it. That is:
I:
O:
The two steps, I and O, are based on the fact that one Authority page is always
referred by many Hub pages, and one Hub page includes many Authority pages. HITS
22
algorithm iteratively computes the two steps, I and O, till they converge. At last,
and are the Authority and Hub value of page . The procedure is shown below:
Initialize , ;
13 Iterate procedure I, O; Perform iteration I; Perform iteration O; Normalize the
value of and , let ;
14 Complete iteration
Figure 4 shows the application of HITS algorithm in a subgraph including 6 nodes.
Figure 4 HITS algorithm on six-nodes graphAs shown in Figure 4, the authority value of node 5 is equal to the Hub values of
nodes 1, 3 which refers to it, after normalization, the value is 0.816.
Assume is the matrix of subgraph, then the value of position in matrix is
equal to 1 (if page refers to ), or 0. Set to be the authority value vector
21 3
54 6
h=0a=0.408
h=0a=0.816
h=0a=0.408
h=0.408a=0
h=0.816a=0
h=0.408a=0
23
, to be the hub value vector , then the iteration I, O can
be expressed as , . After completing the iteration, the values of
authority and hub respectively satisfy , , which and are
constant in order to satisfy the normalization condition. Thus, vector and vector
respectively become the eigenvector of matrix and matrix . This feature is
similar to PageRank algorithm, their convergence speeds are decided by eigenvector.
3.4 Summary
PageRank and HITS are currently the representations of Web retrieval algorithm
based on hyperlink mining. Through the analysis of Web hyperlink relation, we can
greatly improve the accuracy of Web retrieval, and overcome the disadvantage based
on context matching method. Currently, many search engine begin to use similar
algorithm to improve the query precious, for example, Google uses PageRank, Toema
and Altavista also adopt similar technology.
On the other hand, there are defects existing in both PageRank and HITS. PageRank
is independent to query, thus its computation amount is relatively small, but will lose
part of performance on content matching. Although HITS is query related, its link
analysis is only limited in the Web subgraph with thousands of nodes. It can’t reflect
the link condition of whole Web.
24
CHAPTER 3
Literature Review
3.1 Research on Traditional Rank Algorithms of Search
Engines
3.1.1 Problems and Improvements of PageRank Algorithm
The PageRank algorithm is more concerned about old pages, because the probability
of old pages being linked by other pages is much higher, but in fact that new pages
may contain information with better values.
Another problem it may cause is called ‘topic drift phenomenon”. The following
condition should be considered: The portal Websites on Web are always inclined to
make a clean sweep of all aspects of information, which presents as there exists
Website hyperlinks of various topics on their homepages. Meanwhile, many pages
will regard them as a guide for their further information reference, and then include
them in their own links. When searching some key words, if these portal Websites are
in the scope of consideration, there is no doubt that they will acquire the highest
authority, thereupon topic drift phenomenon generates. These portal Websites can be
always found on the top of the searching results, although they also contain the
information required by users, but usually the contents they contain are much
generalized than what the users expect, which is far from satisfactory. In contrast,
25
some professional Websites are more authoritative in describing these topics.
PageRank algorithm is not able to distinguish the hyperlinks in page being related to
its topic or not, that is to say, it is not able to judge the similarity of page content.
Thus, it is easy to cause topic drift problem, for example, Google and Yahoo are the
most popular Web pages on the Internet, and they have very high PageRank values.
Thus, when users input a query keyword, these Web pages will often appear in the
result set of this query, and occupy the very front positions. In fact, sometimes this
Web page is not even related to the users’ query topic.
In Kleinberg’s Hits algorithm paper (1999), he explicitly pointed that those links that
link back to the same Website cannot be counted in Web graph, instead, they should
be discarded. They are a kind of nepotistic links, obviously, not containing any
authority information. After the publication of Kleinberg’s paper, in Bharat and
Henzinger’s paper (1998), they pointed that there exists another nepotistic link, which
is the nepotistic link between two different Website, and this kind of links are trending
to increase rapidly. Moreover, this kind of nepotistic link may be generated in the
construction of Websites accidently. For instance, all the sub Websites of Yahoo have
links referring to main Website. The nepotistic link between two Websites will make
their authorities keep increasing in the iterative process, either for PageRank
algorithm or Hits algorithm.
In order to solving the problem that PageRank algorithm concerns old pages too
much, Ling & Fanyuan (2004) proposed an accelerated evaluation algorithm. This
algorithm make the valuable contents on network deliver in a faster speed,
meanwhile, the evaluation value of some pages containing old data will drop in a
quicker speed. The core idea of this algorithm is to predict the expected value of one
certain URL in a period of further time by analyzing the change condition of
PageRank value based on the time series, and regard it as the effective parameters of
retieval service provided by search engine. This algorithm defines a URL accelerated
26
factor , which is given by:
where is the document amount of the whole page set. The expression of
accelerated PageRank is:
where is the value of URL in the latest time, is the slope of the quadratic
fitting curve of the PageRank value of this URL in a period of time, is the day
interval from the time that the page being downloaded in the latest time, and is
the amount of the documents in the document set downloaded in the latest time. When
users retrieve, search engines will decide the URL position in the retieval results
according to the predicted PageRank value.
The WebGather search engine (Ming, L, Jianyong, W & Baojue, C 2001) developed
by Beijing University applied another way to overcome this weakness, which is to
give compensation to new Web pages. The clicking amount of linked LHN Web pages
can be divided into same Website link amount and different Website link amount.
Different Website link is called pure LHN, and gross LHN contains both. Only pure
LHN is considered here.
To new Web pages, they are not linked by other pages yet, so compensation is given.
27
where is the current time, is the compensated limit time, and is the
time when Web pages are published.
After compensation weight being introduced, the new link weight is:
After standardization:
Haveliwala (2002) proposed a topic-sensitive PageRank algorithm in order to solving
the topic drift phenomenon. This algorithm considers that some pages are thought to
be important in some field, but it doesn’t represent that they are also important in
other fields. Therefore, the algorithm firstly lists 16 basic topic vectors according to
Open Directory (The Open Directory Project, which is a Web Directory for over 2.5
Million URLs), and then to every Web page, computing the PageRank values of these
basic topic vectors in offline condition. When users queries, according to the query
topic or query context inputted by users, the algorithm computes the similarity
between this topic and known basic topic, and chooses a closest topic from basic topic
set to replace the users’ query topic. The formal expression of the algorithm is shown
as follow:
where is the topic-sensitive vector of Web page . This algorithm can effectively
avoid some obvious topic drift phenomenon, for example, when querying “jaguar”, if
the instruction of context is available, the algorithm can explicitly distinguish
whatever users tried to search:
28
1 jaguar car; 2 jaguar football team;
3 jaguar product;
4 jaguar, which is a kind of mammal,
thus, provide high-quality recommendation result set.
3.1.2 Problems and Improvements of Hits Algorithm
Multiple Websites posses some links that recursively refer to each other due to some
reason, which causes “faction attacks” emerge between these Websites. For instance,
some enterprise Websites are designed by the same Website design company for
different companies, it should be possible that there are friendly links between them.
The impact brought by faction attacks is similar to nepotistic links, but it is more
difficult to be detected than nepotistic links, because larger scope of Web graph needs
to be inspected. Another type of problem is called “mixed hub page phenomenon”,
which is a hub page simultaneously possesses links referring to several categories of
completely different topics. For example, a hub page related to a movie awards
usually includes a lot of links referring to movie companies. Mixed hub page is more
difficult to be detected by computers than faction attacks, furthermore, the probability
that it emerges is higher as well. It is possible for mixed hub page to mix the Web
pages with different topics together, especially in HITS algorithm, it is quite easy to
involve Web pages that are irrelevant to current topic in the process of constructing
extended set, and due to these Web pages have a large amount of links refer to pages
with higher authority, they cannot be discarded from results. In Google’s PageRank
algorithm, this impact can be reduced by adjusting the random surfer probability .
In Hits algorithm, firstly a basic set is constructed, and then extended to extended set
through basic set, finally the whole Web graph is formed. The reason of doing this is
possibly that the result acquired by the information retrieval system in the first step
doesn’t include the pages that users really demand. For instance, when querying with
keyword “browser”, the pages returned by information retrieval system usually
29
doesn’t contain the pages of Netscape Navigator, Microsoft Internet Explorer, because
their pages will usually avoid using words such as “browser” to make product
promotion. Furthermore, usually some personal page will use words such as “best
viewed with a frames-capable browser…”, which causes the originally important
Netscape and Microsoft’s pages cannot be included in the results in the first step. This
problem can be solved by expended set, because the required Web pages can be
acquired through hub page. Due to this characteristic, HITS algorithm can be
impacted by the nepotistic links, faction attacks and mixed hub page mentioned
above. When constructing extended set, too many pages irrelevant to topic are
involved, and they are also with higher authority because of possessing links referring
to each other. If we restrict the radius when the extended set is constructed, it is
possibly that we can’t get enough pages. The really decent pages can be acquired only
if the radius is big enough when constructing extended set, but by then too many
irrelevant pages have been involved and causing “topic pollution phenomenon”.
Besides, similar to PageRank algorithm, Hits algorithm is also impacted by “topic
drift”. After including these portal Websites through extended set, Hits algorithm will
face the same difficulty as PageRank.
Bharat and Henzinger (1998) improved the computation method of authority weight
and hub weight by means of introducing relevance weight to hyperlinks, if the
relevance weight of hyperlink is smaller than certain threshold, then we consider the
impact of this hyperlink to page weight can be neglected, and this hyperlink will be
discarded from the subgraph. Besides, Chakrabarti, Dom and Gibson (1999) proposed
an idea that split big hub page into smaller units. A page always includes many links,
which possibly not relevant to the same topic. In this situation, in order to get a good
result, it is better to divide the hub page into continuous subsets and then make a
process, these subsets are called pagelet. The single pagelet refers to a topic more
concentrative than the whole hub page, so better retrieval results can be acquired by
computing weight for every pagelet. In the Clever system which is an application
30
example of HITS algorithm, the author computed the weight of hyperlink by means of
matching query keyword with the text around hyperlink and computing the word
frequency, and then replace the corresponding value in adjacency matrix with the
computed weight, thus achieves the objective of introducing semantic information
(Chakrabarti, B, Dom, B & Raghavan, P 1998).
Time parameter is introduced to improve HITS algorithm as well. To the reference of
a certain determined Web page, i.e., node reference node , its application time, to
a great extend, reflects whether this referenced node is authoritative or not. In reality,
the visiting time of the authority pages which the users really want to visit should be
relatively long, and to those visits act as navigations occasionally or for some other
purposes, the visiting time is relatively short. In other words, if users’ visiting time to
a certain page is relatively long, then we can consider this page as the page that the
users want to visit, which is target page. If this information is applied in the
computation of authority weight in HITS algorithm, the accuracy of HITS algorithm
can be greatly improved.
Xuelong, Xuemei and Xiangwei (2006) proposed a time parameter control model
which is described as follows: Define the hyperlink weight related to keyword
which refers from page to page is , this final value of is
determined by three facts: the link referring from to ; the emergence times of
query keyword in the hyperlink characters, which is ; the visiting time that visits
, which is . In order to control the result more precisely, a coefficient is introduced
to control the proportion, which is in hyperlink weight, of semantic information of the
surrounding characters in , and parameter is introduced to control the impact
of visiting time to weight, then the weight control model with time parameter is given
by:
where can be adjusted according to different page sets, and the value of will
31
continuously increase in the iterative procedure of computing authority weight, but we
only concern about the relative value between them, not the absolute value.
reflects the non linear increment of its authority weight with the increment of visiting
time, and other function can be constructed to control the proportion of visiting time
in weight as well, the function above is the simplest form.
Figure 5 Function image of
3.2 Traditional Domain Ontology-based Concept Semantic
Similarity Computation
3.2.1 Ontology
Ontology is a terminology in philosophy in the earliest, which is a systematic and
comprehensive explanation to objective existence, and its core is to represent the
abstract essence of objective reality (Zhihong, D & Shiwei, T 2002). In recent years,
ontology research is becoming mature, but in various literatures, the definition of
ontology and usage of related terminology are not completely consistent. Neches et al.
(1991) introduced the concept of ontology into artificial intelligence, and gave the
earliest definition about ontology, which is that the relation between basic
terminologies constituted by related domain knowledge and terms, as well as the
extension rules determined by these basic terms and relations. Gruber (1993) gave the
t
tTime
32
most popular definition of ontology, which is that ontology is the definite rule
explanation of concept model. Later, Studer, Benjanmins and Fensel (1998) made
further research on ontology, and gave the most complete definition about ontology,
which is that ontology is a definite formal standard specification sharing concept.
There are four level meanings included here, which are concept model, explicit,
formalization and common sharing (BernersLee, T, Hendler, J, Lassila, O 2001).
3.2.2 Three Main Semantic Similarity Computation Models
Concept semantic similarity computation has wide application in the fields of
information retrieval, information recommendation and filtering, data mining and
machine translation, etc, which has become a hot point of current information
technology research (Sujian, L 2002). Currently, to the semantic similarity
computation between concepts, researches are performed mainly from three different
views. Leacock (2005) proposed a distance-based semantic similarity computation
model. This kind of computation model is simple and visual, but it extremely relies on
the ontology hierarchical network established in advance, and the network structure
directly influences semantic similarity computation. Lin (2000) proposed an
information content-based semantic similarity computation model. This kind of
computation model has more persuasion in theoretically, because when computing
concept semantic similarity, the related knowledge of information theory and
probability statistic theory are fully used. But this kind of computation model can only
grossly quantify the semantic similarity between concepts; it can’t distinguish each
concept semantic similarity more detailed. Tversky et al. (2004) proposed an attribute-
based semantic similarity computation model. This kind of model can simulate
people’s regular understanding and discrimination between things in real world, but
this method only considers a unique attribute fact, so to every attribute of objective
things, performing detailed and comprehensive description is required, which is quite
difficult.
33
3.3 Summary
In this chapter, we firstly analysis the problems of PageRank and HITS algorithm, and
some comparisons between them are made as well. Meanwhile, the ideas of several
improved methods of classical algorithms are given. Then, some basic information
about ontology and domain ontology-based concept similarity computation is
introduced, and the further research and analysis will be given in Chapter 5.
34
CHAPTER 4
Methodology
4.1 Research Questions
My main research objective is to improve rank algorithms in order to make them more
concerned about users’ surfer behaviors. To achieve this target and make my research
easier, some research questions are listed as follows:
What facts will influence the relation between two concept nodes in the
hierarchical network structure in domain ontology?
A web page maybe related to several topics, how to determine the category it
belongs to. If it is categorized into a certain category, how to make the other
categories it related to being considered as well.
How to implement categorization to Web pages and keywords?
Because the amount of Web pages and keywords are huge, do I need to introduce
any mechanism to reduce the unnecessary amount?
4.2 Research Strategy
The whole process of my research strategy is shown in Figure 6, which can be divided
in to two main steps, the first step is to model an improved domain ontology-based
concept similarity computation model, and the second step is to integrate rank
algorithm with categorization technology.
35
In the first step, firstly we will discuss the traditional three domain ontology-based
concept similarity computation models in order to get a full understanding about their
ideas, computing processes, advantages and disadvantages. Then, we will discuss and
improve the decision facts that have impact on directed edge weight in ontology
hierarchical network. At last, the improved domain ontology-based concept similarity
computation model including five decision facts will be modeled and evaluated.
In the second step, firstly we will discuss an existing categorization-integrated rank
algorithm, which combines PageRank with categorization technology in order to
provide a theoretical support. Secondly, the basic idea of categorization in this paper
is given, which will describe how categorization is implemented and the processes of
pre-categorization based on the improved model constructed in step one. Then, a
screen mechanism is introduced to filter the massive data amount. At last, the
improvement and evaluation of HITS algorithm integrated with categorization will be
provided.
36
Figure 6 Research strategy framework
Distance-based
Content-based
Attribute-based
DepthCategory Density Strength AttributeModeling Evaluate
Research on combination of PageRank and categorization
Define the basic idea of categorization Modeling Evaluate
How to implement categorization
Pre-categorization of Web pages
Pre-categorization of keywords Categorization
similarity table
Screen mechanismCombine HITS with categorization
Research on traditional domain ontology-based concept semantic similarity computation models
Improvement of decision facts
Modeling and evaluating improved domain ontology-based concept semantic similarity computation model
Constructing categorization similarity table according to the improved model
Performing pre-categorization processes according to categorization similarity table
Modeling and evaluating HITS algorithm integrated with categorization
37
4.3 Evaluation Tools
4.3.1 The Ontology Tool
The ontology tool adopted to evaluate the improved concept semantic similarity
computation model in this paper is Protégé 3.4, which is an ontology modeling tool.
Protégé is designed by Stanford University to edit instance and acquire knowledge,
which is currently the most popular ontology development tool. It shields the
shortcomings of many current ontology creation languages, and provides a friendly
GUI interface which is shown in Figure 7, which makes it much easier to edit class,
instance and attribute.
Figure 7 Main interface of Protégé 3.4
38
CHAPTER 5
Improve Concept Semantic Similarity Computation Model
5.1 Discussion on Traditional Computation Models
5.1.1 Distance-based Semantic Similarity Computation Model
The basic idea of this computation model is to quantify the semantic distance between
concepts by using the geometric distance of two concepts in hierarchical network
(Qun, L & Sujian, L 2002). The easiest computation method is to consider the
distances of all directed edges in network as equal importance, denoted by 1. Thus,
the distance between two concepts is equal to the amount of directed edges
constituting shortest distance in hierarchical network of the node which these two
concepts corresponds to. According to this idea, a simple semantic similarity
computation model can be obtained:
where is the maximum depth of network structure, is the amount of
directed edges of the shortest path between concept node and .
However, the above computation model is very rough in computing the semantic
similarity between concepts, which the difference between directed edges in network
structure is not considered. Then, Leacock (2005) performed an improvement to the
computation model based on it, and proposed an improved distance-based semantic
similarity computation model:
where is the closest common ancestor node of concept nodes and
39
in hierarchical network, is the shortest distance of concept nodes
and in hierarchical network, and is the maximum depth of network.
5.1.2 Content-based Semantic Similarity Computation Model
The basic principle of content-based semantic similarity computation model is that if
the more information two concepts share, the higher semantic similarity between
them; contrarily, the less information two concepts share, the lower semantic
similarity between them (Xiaofeng, Z, Xinting, T & Yongsheng, Z 2006). In
hierarchical network, every concept can be considered to be the refinement to its
ancestor node, so it can be nearly interpreted as every child node includes the
information contents of its entire ancestor node. Thus, the semantic similarity of two
concepts can be measured by the information contents of their closest common
ancestor node.
According to information theory, if the higher frequency a concept appears, the less
information amount it includes; contrarily, the lower frequency a concept appears, the
more information amount it includes. In hierarchical network, the computation
formula for quantifying the information amount of every concept node is:
where is the probability that concept appears in training material, is
the information amount that concept has.
Thus, according to the above quantization formula of concept information, the
semantic similarity computation model between arbitrary two concepts in hierarchical
network can be obtained.
where is the closest common ancestor node of concept nodes and
in hierarchical network.
5.1.3 Attribute-based Semantic Similarity Computation Model
In real world, the process that people distinguish and associate different things
generally by means of comparing the inherent attributes between things (Qianhong, P
& Ju, W 1999). If two things have many same attributes, it indicates that these two
thins are very similar; contrarily, it is opposite. Thus, the basic principle of attribute-
40
based semantic similarity computation model is to judge the similarity degree of
attribute set which the two concepts corresponding to. Tversky proposed an attribute-
based method for computing concept semantic similarity:
which is the attribute set that concepts and commonly posses,
is the attribute set that concept possesses but concept doesn’t possess,
is the attribute set that concept possesses but concept doesn’t possess.
Besides, L. Rips proposed a multi-dimensional attribute-based semantic similarity
computation model: Set concept and respectively has attributes, and the
attribute value respectively is ,
.
where is adjustment factor.
5.2 Decision Facts of Semantic Similarity Computation
In a directed no-loop hierarchical network constituted by domain ontology, the
weights of directed edge may be different, that is to say the semantic similarity
between parent node and child node located in the two ends of different directed edge
is different. Thus, it indicates that the influence of weight needs to be considered
when computing the distance length between concepts. According to my research,
there are five main facts influence the weight of directed edge in ontology hierarchical
network:
The category of directed edge between parent node and child node
The depth of directed edge constituted by parent node and child node in
hierarchical network
The density of parent node and child node in hierarchical network
The strength of directed edge constituted by parent node and child node in
hierarchical network
The attribute of concept node of the two ends of parent node and child node
41
5.2.1 Directed Edge Category
There are many categories of relations between concepts, which is shown in Table 2:Table 2 Semantic relations
Seq.
No.
Semantic
relation
Extraction rule
1 ISA If
Then
2 AKO If
Then
3 Have If
Then
4 Can If
Then
5 Is If
Then
6 Part-Of If
Then
7 Composed
-Of
If
Then
8 Belong-To If
Then
9 Time If
Then or
10 Position If
Then
11 OthersIf
Then
However, in the hierarchical network constituted by domain ontology, only three main
relations are generally considered, which are inheritance relation, entirety-part relation
and synonymous relation, because these three relations have the highest proportion.
42
The weights that different directed edge categories corresponding to are different. To
synonymous relation, its nodes of two ends represent the same meaning, so the weight
of this edge should be bigger than it of the other two categories. Besides, the directed
edge weight of inheritance relation is generally considered to be bigger than it of
entirety-part relation. Thus, the relations about directed edge weight and their
categories can be obtained:
where is the weight of directed edge constituted by child node and its
parent node .
5.2.2 Directed Edge Depth
Domain ontology can be considered as hierarchy network graph. There is only one
ingress node in this graph, which is the maximum concept of this domain. The
second-level nodes are the partition of ingress node (first-level node), and the third-
level nodes are the further refinement based on second-level nodes, and so on. Every
level is the concept refinement of the level above. The meanings of concept are
concrete in lower level; contrarily, the meanings of concept are abstract in higher
level. Thus, the weight of directed edge is related to its depth in hierarchical network,
so the relation about directed edge weight and its depth can be obtained:
where is the depth of node in hierarchical network.
5.2.3 Directed Edge Density
The overall density in domain ontology hierarchical network is a fixed value, but the
density in different place is different. If the node density of a certain local area in
hierarchical network is larger, it indicates that the refinement to concept is bigger
here, and the weight of corresponding edge is larger. Thus, the relation about directed
edge weight and its density can be obtained:
43
where and are the ingress degree of parent node and
child node in hierarchical network, and respectively
represents the egress degree of parent node and child node in hierarchical
network, and and represent the ingress degree and
egress degree of hierarchical network graph.
5.2.4 Directed Edge Strength
In the hierarchical network constituted by domain ontology, a parent node may have
multiple child nodes. If a child node is more important than the other nodes to this
domain, the weight of directed edge constituted by this child node and its parent node
should be bigger. Thus, if we use the condition proportion to quantify the strength of
directed edge, the following can be obtained:
where represents the former is important than the latter
and (In hierarchical network, the place that child node appears can
be nearly considered as parent node appearing as well.)
and (According to the computation model based on
information content.)
where represents the strength of directed edge constituted by child node and
parent node.
where is adjustment factor.
5.2.5 Concept Node Attribute of the Two Side of Directed Edge
Domain ontology hierarchical network not only makes correct definition to the
concepts and their relations in the domain, but also makes detailed description to the
attribute of every concept. Thus, if the concept that the child node and parent node of
the two side of directed edge corresponds to possesses more same attributes, it
44
indicates that the relation between parent node and child node is closer, and the
weight of directed edge constituted by them is larger. Thus, the relation about directed
edge weight and its attribute can be obtained:
where and is respectively the attribute set of concept and concept
, is the intersection attribute set of concept and concept ,
is the union attribute set of concept and concept , is the
amount of statistic attribute.
5.3 Establishment of Improved Computation Model
In this section, according to the special characteristic of domain ontology, we
establish an improved concept semantic similarity computation model by taking
advantage of the five influence facts about directed edge weight analysing in the
section 5.2. The procedure of establishment is described as follow:
1 The domain ontology completed by domain specialists can be considered as a hierarchical, directed and no-loop graph,
where is the set of all the nodes in graph, and each node represents the set of
concept and its attribute in domain, is the set of all the directed edges in graph,
and each directed edge represents some kind of relation existing between nodes.
2 As mentioned in section 5.2, the unit directed edge weight of hierarchical network
constituted by domain ontology is related to five facts, so, the facts influencing
weight need to be fully considered when qualifying the weight of directed edge.
Thus, the expression of directed edge weight should be:
After substituting the relations between directed edge weight and category, depth,
density, strength and attribute analysing in section 5.2, we can obtain:
where is adjustment factor, , .
45
3 Because the length of unit directed edge is inversely proportional to the weight of
directed edge, the computation model of unit directed edge length can be obtained:
where is adjustable factor.
4 As the computation formula of unit directed edge length in ontology hierarchical
network is known, so the distance between any two concept nodes in hierarchical
network can be obtained (Here, we still use Leacock computation model to
compute the distance between two concepts in domain ontology):
where is the closest common ancestor node of node and ,
is the set of all the nodes in the shortest path of node and in
hierarchical network.
5 As the distance between any two concepts in ontology hierarchical network is
known, the semantic similarity computation model of any two concepts can be
obtained:
where is amplification factor.
5.4 Evaluation of Improved Computation Model
Figure 8 shows part of the structure graph about the subject “Data Structure”, which is
constructed according to the construction rule of ontology, the numbers in the graph
represent the information amount of corresponding concepts.
Ontology modeling tool Protégé 3.4 is adopted to create part of the ontology of data
structure in the experiment. The similarity values between concept “Linear Structure”
and other concepts are obtained by means of the improved concept semantic similarity
computation model. When computing semantic similarity, generally the impact of the
facts such as attribute and class are stronger than density, strength and depth, so the
parameter values are: . Figure 9
shows the created “Linear Structure” ontology.
46
Figure 8 Ontology of linear structure
Figure 9 Screenshot of Linear Structure
Linear Structure Linear List
Stack
Sequential Stack
Linked Stack
Queue
Circular Queue
Linked Queue
Sequential storage structure
Linked storage structure
String
Static Allocation
Dynamic Allocation
Circular Linked List
Single Linked List
Double Linked List
Double Linked Circular List
7.03
9.02
8.92
8.41
8.5
8.61
9.03
9.31
9.81
9.5
9.61
10.46
11.46
13.46
10.36
12.46
10.86
47
Table 2 shows the data values between concept “Data Structure” and other several
concepts obtained by means of different similarity computation methods in the same
ontology structure. As we can see from the table, compare with the traditional
computation models, the improved computation model is much closer to the expert
experience in the field of quantifying the semantic similarity between concepts.
Table 3 Experimental result
Concept Improved
computation
model
Traditional computation
model
Expert
experience
Content
-based
Distance
-based
Sim(Linear structure, Linear List) 91.3% 62.9% 87.6% 90%
Sim(Linear structure, Stack) 90.5% 81.2% 83.2% 88%
Sim(Linear structure, String) 89.9% 81.1% 78.1% 85%
Sim(Linear structure, Queue) 84.2% 80.5% 78.3% 83%
Sim(Linear structure, Linked Stack) 80.3% 86.5% 87.6% 78%
Sim(Linear structure, Sequential Stack) 77.4% 63.3% 78.5% 77%
Sim(Linear structure, Sequential storage) 76.3% 80.1% 64.6% 75%
Sim(Linear structure, Linked storage) 68.2% 73.6% 62.8% 70%
Sim(Linear structure, Circular Queue) 67.8% 62.7% 70.3% 68%
Sim(Linear structure, Linked Queue) 65.3% 59.3% 63.6% 66%
5.5 Summary
In this chapter, we perform analysis and explanation on the three traditional semantic
similarity computation models, and propose an improved domain ontology-based
concept similarity computation model according to the advantages and disadvantages
of these three computation models as well as the specific properties of domain
ontology. In this computation model, we firstly perform effective quantization to the
five facts which have impacts on the directed edge weight of ontology hierarchical
network according to the characteristic of ontology network, then make linear
weighted combination according to the impacts of these five categories of facts on
directed edge weights, so as to quantify the semantic similarity between the concept
nodes in ontology network more comprehensively.
As shown by the experimental result, this improved computation model can reflect the
semantic similarity between concepts with better accuracy, which provides an
effective quantization for the semantic relations between concepts.
48
CHAPTER 6
Improve Rank Algorithm Based on Categorization Technology
The traditional rank algorithms for search engines are all based on the link structure
analysis of Web page. In traditional PageRank algorithm, the PageRank score
represents the probability that user browses a certain Web page, and the score of HITS
algorithm is based on the Authority and Hub of Web page. These algorithms are all
based on the assumption that the process of user browsing Web pages is absolute
random, but ignore the influence to link direction from the similarity of Web page
content, which makes them not able to fully reflect the difference in importance
degree of this Web page between different users. In this chapter, we design an
improved HITS algorithm based on categorization, which combines rank algorithm
with categorization technology.
6.1 Combination of Categorization Technology and Link Structure Based Algorithm
In previous research works, some categorization technology-based improved
algorithms are already proposed. For example, CategoryRank Algorithm (Weizhu, C,
Ying, C & Yan, W 2005) is a kind of integrated rank algorithm that combines
PageRank with categorization technology
To any link from to , CategoryRank algorithm obtains a weight according to the
category that and belong to and the similarity degree between categories, adds
this weight value to , so as to modify the computation method of
PageRank, and forms single link-based CategoryRank computation. Because and
belong to the same or similar categories, but and belong to different categories, so
the link from to is obviously important than it from to , which represents in
49
the wider width.
To each link referring from to , the algorithm will firstly use categorizer, which
regards the content of and as the input of categorizer. Assume in the
categorization results, trends to category more, and trends to category
more. Here, assume and respectively represents the probability
that belongs to category and , then, in CategoryRank algorithm, the similarity
between and can be expressed as:
where is the similarity degree between and , and respectively
represents the eigenvector of category and .
At last, normalization is need to in order to satisfy:
where the relation between and is , which there exists a link referring from
to . Make modification to PageRank formula, replace with category-related
value , then the computed value by means of link-based CategoryRank
algorithm is:
Besides considering the relation between Web pages, meanwhile, the different
appearances of single Web page in different categories need to be considered. For
example, a certain Web page is possibly not important in category , but very
important in category . Thus, according to different catefories, Web page-based
CategoryRank algorithm will compute out different importance weight value. Assume
a system includes categories in all, then each Web page will include category
weight values according to categories. At last, the system will generate
CategoryRank vectors according to different categories, which each vector is
corresponding to a category, and the vector element value is the category weight value
which each Web page belongs to in Web. Assume Web includes pages in all, then
all the computation results can be expressed by a matrix with elements,
50
which is the element value of matrix, representing the category weight of Web
page in category .
When creating the vector of category , several processes can be performed to .
For example, it can be combined with value which is used to compute
relativity, or PageRank value. Here, the algorithm directly combines into the link-
based CategoryRank computation formula mentioned above. We can consider that
when changing Category values by introducing category information into formula,
these values will eventually be transferred to rank score, so as to achieve the target
that introduce page categorization technology into ranking. Before modify the link-
based CategoryRank, the algorithm will firstly normalize in order to satisfy:
where is Web page ID, is the category ID, is the total Web page amount of in
whole Web.
At last, according to each category , a single CategoryRank vector need to be
created. Thus, the formula can be modified as :
The formula above considers not only the category similarity of links, but also the
category difference of Web pages. Finally, the category information of each Web page
is integrated into Category algorithm.
CategoryRank algorithm is based on the category information between two arbitrary
Web pages, and performs analysis and computation to link graph. Compared with
PageRank algorithm, this algorithm can simulate users’ habit in browsing Web pages
more accurate. Meanwhile, according to each Web page in Web, the algorithm
computes their category attributes, which directly reflects the importance degree of
this page to different users.
Compared with traditional RankRank algorithm, CategoryRank algorithm has added
category parameters and computation.
51
6.2 Basic Idea of Categorization
6.2.1 Implementation of Categorization
In order to implement categorization, we proposed two assumptions:
1 Every Web page is categorize-able, and can be marked with a main category, denoted by C.
2 There is relation at different degree between every category, expressed by
category similarity degree .
To every Web page, there is always a related content topic, which is subordinated to a
certain category of knowledge, or has a certain category of feature. These categories
can be also subdivided into more professional and specific subcategories. The
category division is according to the content topic of Web page, not the properties. For
example, the homepage of Google Website can be divided to the search engine
subclass of network technology; a news class can be divided to the related class of this
news topic.
To the same Web page, its topic may be related to category , as well as related to
category B and category C at different degree. This relativity can be divided into:
Directly topic-related
The content of a Web page may also include several categories of topic, so there is
a directly topic-related relation between this Web page and related topics. For
example, the homepage of msn includes many different aspects of class, so this
Web page is directly related to these topics. The direct relativity between a Web
page and each topic can be expressed by vector group
.
Indirectly topic-related
There are relations existing between categories. If Web page is determined to
be related to category , and the relativity between category and category is
very high, then can be considered to have indirect topic relativity to category
as well. Its relativity to is determined jointly by both the relativity between
category and category and the relativity between Web page and category .
The indirect relativity between a Web page and each topic can be expressed by
vector group .
The vector sum of direct and indirect topic category of Web page topic is the topic
category vector of Web page:
52
6.2.2 Pre-categorization Processes
6.2.2.1 Pre-categorization of Web Pages
When search engines store data, they compute the category of Web pages, and save
the category information of Web pages. This process is called the pre-categorization
of Web pages.
Firstly, determine the main category of a Web page, which is to extract Web page
topic according to the Web page contents, and compute the direct category vector of
Web page. Then, input the direct category vector, and compute the indirect category
vector according to category similarity table. At last, combine the direct category
vector with indirect category vector, and obtain the topic category vector of Web page.
The category similarity table is based on the inherent attribute of knowledge
categories. Its establishment can follow the domain ontology-based semantic
similarity computation model established in Chapter 5. The subcategories in the same
main category have higher similarity, and the similarity between different categories is
relatively lower.
6.2.2.2 Pre-categorization of Keywords
The process corresponding to the pre-categorization of Web pages is the pre-
categorization of keywords. When user input a keyword and retrieval Web pages
according to this keyword, according to the categories that the meaning of this
keyword belongs to, the categories that the search Web page belongs to will be
screened. Then, make screen to the Web pages according to chosen category. This
kind of two-level screen process composes the pre-processing model that implements
categorization technology.
The pre-categorization model of keywords is same as it of Web pages. Firstly, obtain
the main category of keyword according to keyword categorizer. Then, compute the
indirect topic category according to category similarity table. At last, obtain the
weight value of keyword in all category by adding direct and indirect topic categories.
53
Figure 10 Pre-categorization framework
Category similarity tableCategory aCategory bCategory c……Category aCategory bCategory c……
Web pages
Web categorizer
Compute direct topic vector
Compute topic category
Categorization resultCategory aCategory bCategory c……Web 1Web 2Web 3……
Category resultsCategory aCategory bCategory c……Keyword
Keywords
Keyword categorizer
Compute direct topic vector
Compute topic category
Direct topic category Direct topic category
Indirect topic category Indirect topic category
54
6.3 Modeling
6.3.1 Category Selective Mechanism
Because the amount of Web pages is huge and increasing explosively, the pre-
processing for searching process is particularly important. If the selection for
searching results can reduce the scope before searching all Web pages, it is helpful to
improve retrieval efficiency.
The process of selective mechanism is described as follow: Select the keyword
similarity coefficient , category similarity coefficient . When users
retrieval keyword, search engines firstly select the categories with the similarity
between to according to the category vector of retrieval keyword. Then, search
engines select the Web pages with the category similarity between to according
to the category vector of Web pages.
Though two-level categorization selective mechanism, after compare searching scope
with the category similarity of Web page according to keywords, the Web pages with
great difference in category are discarded. Because users are hardly concerned about
the information with low rank when browsing the result returned by search engines,
the pre-processed Web pages will not influence the searching performance in users’
view. And the amount of Web pages is greatly reduced, so the overhead of iterative
operation is reduced as well when computing rank
The selection of : When the value of tends to , the similarity of keyword
category is the highest, and only the direct topic category should be chosen to make
Web page selection. When the value of tends to 0, it is equal to not make category
selection.
55
The selection of : When the value of tends to , the similarity between Web page
category and keyword category is required to be the highest, and only the Web pages
with the same category are retrieval. Thus, the coverage rate of retrieval Web page is
relatively low, and it is easy to cause missing important information. When the value
of tends to , the similarity between Web page category and keyword category is
required to be the lowest, all the Web pages are retrieval, and degenerate to a selective
mechanism without pre-categorization.
6.3.2 Integrating HITS Algorithm with Categorization
The basic principle of Hits algorithm is that how many links that exterior refers to a
Web page represents the authority degree that this Web page has, which can be
measured by authority value ; how many links that a Web page refers to exterior
represents how much information that this Web page can provide as an information
center, which is the hub degree of this Web page, measured by hub value .
The relation between the authority value and hub value of Web page is shown in
Figure 10.
Figure 11 Relation between authority value and hub value
By several iterative computations according to the link structure of Web page, the
authority value and hub value of Web page can be obtained.
Consider the relating degree between the links of Web page and topic categories,
which the links with the target belonging to same or similar topic category are
1u2u1u
v1v
2v
3vu
56
considered to be more important.
Let the differential sum of Web pages and on each category vector as the
difference degree of the two Web page categories.
Compute the difference degree between the category attribute of all the Web pages ,
which referring to Authority Web page , with , expressed by
in the above formula, is the total category amount, is the difference
degree between Web pages , which referring to Authority Web page , and Web
page on category component . is the percentage of difference
degree. is the summation of difference degree ratio on each category
component, so as to obtain the difference degree on all categories.
Compute the difference degree between the category attribute of all the Web pages ,
which referred by Hub Web page , with , expressed by
in the above formula, is the total category amount, is the difference
57
degree between Web pages , which referred by Hub Web page , and Web page
on category component . is the percentage of difference degree.
is the summation of difference degree ratio on each category
component, so as to obtain the difference degree on all categories.
Where and is respectively the vector attribute of Web page and on
category , and is respectively the vector attribute of Web page and on
category . The values of , , , are generated in the pre-categorization
process of Web pages.
After normalizations:
According to the category difference degrees, compute the category similarities of
Web pages referred by links:
According to the formula above, modify the formulas of Authority value and Hub
value in HITS algorithm to:
58
6.4 Evaluation of Integrated Algorithm
The formula above integrates the category information of each Web page into HITS
algorithm, and the category relativity of link is considered while computing the link
structure of Web pages. Through this kind of method, the shortcoming that each link
between Web pages is treated equally in traditional HITS algorithm is overcome.
Combine the category information of Web pages themselves with link information,
and let it as the modified parameter of computation, thus, the accuracy of HITS
algorithm is improved.
In the aspect of computation complex degree, the application of categorization
information modifies the basic set of HITS algorithm by means of the pre-
categorization mechanism. Because HITS algorithm itself needs perform iterative
computation, the use of pre-categorization can greatly reduce computation overhead
by reducing Web page amounts, so as to improve the performance of HITS algorithm.
Meanwhile, the computation overhead of categorization process is , is the total
category amount. Because the magnitude of is always smaller than it of Web page
amount, the improved categorization-based HITS algorithm is better than HITS
algorithm on the complex degree of algorithms.
6.4 Summary
In this chapter, we perform improvement to HITS algorithm from two aspects, which
are the Web page pre-processing and analyzing the link structure of Web page, and
provide completive and detailed mathematics expression. After completive algorithm
expression and application, it shows that according to category information, more
accurate rank results can be obtained. The shortcoming caused by analysis and
59
computation only based on the link structure of Web page in HITS algorithm is
overcome, and the advantage that let Web pages in similar or same category have
higher relativity is obtained, so as to make rank algorithm simulate user habits when
browsing page in real time more accurate.
60
CHAPTER 7
Conclusion
In this paper, through the research and analysis on classical link structure-based
algorithms and their related improvements, we propose an improved HITS algorithm
based on categorization technology.
7.1 Summary of contributions
Improve domain ontology-based concept similarity computation based on five
decision factors.
We perform effective quantization to the five facts which have impacts on the
directed edge weight of ontology hierarchical network according to the characteristic
of ontology network, then make linear weighted combination according to the impacts
of these five categories of facts on directed edge weights, so as to quantify the
semantic similarity between the concept nodes in ontology network more
comprehensively.
Improve HITS algorithm based on categorization technology.
We integrate the category information of each Web page into HITS algorithm, and the
category relativity of links is considered when computing the link structure of Web
pages. Though this method, the shortcoming of traditional HITS algorithm is
overcome, which is each link between Web pages is treated equally. We combine the
category information of Web pages themselves with link information, and let it as the
61
corrected parameter of computation, so as to enhance the accuracy of HITS algorithm.
Besides, through the pre-categorization mechanism of category, the usage of
categorization information modifies the basic set of HITS algorithm. Because HITS
algorithm itself need perform iterative computation, the usage of pre-categorization
can greatly reduce computation overhead, and improve the performance of HITS
algorithm.
7.2 Future work
In this paper, we mainly discuss the theoretical researches and improvements of rank
algorithms for search engines, link structure-based algorithm is still the most common
rank technology being used now, which is relatively mature in the application of
commercial search engines.
With the passage of time, the personalized user-oriented search engine will certainly
take the place of the mainstream search engine using currently. However, the
personalized search engine need perform a long-period study in the aspects of
analyzing users’ habit in using search engines, the interesting topics, as well as some
other use characteristics, including the habits of choosing keywords, choosing results
from the returning set of query, the characteristics of query content and so on, so as to
become a personalized search engine accord to use habit which is a special
customization for users. In this paper, we only make limited exploration in the field of
user habit, which is also a general character, and the analysis and improved model
aiming at single user is not involved. Thus, the simulation of user behaviors is
relatively rough. In the field of researching personalized search engines, we need
obtain large amount of feedbacks and statistics data of users in a long period of time,
which can be accumulated and carried out in the further work.
62
References
Berners Lee, T, Hendler, J, Lassila, O 2001, ‘The semantic web’, Scientific American,
284(5)34-43.
Bharat, MK & Henzinger, R 1998, ‘Improved Algorithm for Topic Distillation in a
Hyperlinked Environment’, In Proceeding of {SIGIR}-98, 21st {ACM} International
Conference on Research and Development in Information Retrieval.
Brin, S & Page, L 1998, ‘The anatomy of a large-scale hypertexual web search
engine’, In Proceeding of the WWW7 Conference, page 107-117.
Broder, A, Kumar, R, & Maghoul, F 2000, ‘Graph structure in the web: experiments
and models.’ In Proceeding of the Ninth International World-Wide Web Conferecne.
Itsky AB & Hirst, G 2004, ‘Evaluating word net-based measures of lexical semantic
relatedness’, Computational Linguistics, 1(1);1-49.
Chakrabarti, S, Dom, B & Gibson, D 1999, ‘Mining the Link Stucture of the World
Wide Web’, IEEE Computer, 32(8).
Chakrabarti, S, Dom, B & Raghavan, P 1998, ‘Automatic resource compilation by
analyzing hyperlink structure and associated text’, In Proceeding of the Seventh
International World-Wide Web Conference.
Chakrabarti, S, van den Berg, M & Dom, B 1999, ‘Focused crawling: A new
approach to topic-specific web resource discovery’, In Proceedings of the Eighth
International World-Wide Web Conference.
63
Dean, J & Henzinger, RM 1999, ‘Finding related pages on the Web’, In Proceeding of
the WWW8 Conference, page 389-401.
Gan, KW & Wong, PW 2000, ‘Annotation information structures in Chinese texts
using how net’, Hong Kong: Second Chinese Language Processing Work shop, 85-
92.
Gruber, T 1993, ‘Ontolingua: A translation approach to portable ontology
specification’, Knowledge Acquisition 5(2), pp.199-200.
Haveliwala, HT 1999, ‘Efficient computing of PageRank’, Stanford Database Group
Technical Report.
Haveliwala HT 2002, ‘Topic-sensitive PageRank’, Proceedings of the Eleventh
International World Wide Web Conference.
Kleinberg, J 1999, ‘Authoritative sources in a hyperlinked environment’, Jouranl of
the ACM, 46(5):604-632.
Lawrence, S & Lee Guiles, C 1998, ‘Context and page Analysis for Improved Web
Search.’, IEEE Internet Computing, page38-46.
Ling, Z & Fanyuan, M 2004, ‘Accelerated evaluation algorithm: a new method to
improve Web structure mining quality’, Computer Research and Development,
41(1):98-103.
Mendelzon, OA & Rafiei, D 2000, ‘What do the neighbors think? Computing web
page reputations’, IEEE Data Engineering Bulletin, Page 9-16.
Ming, L, Jianyong, W & Baojue, C 2001, ‘Improved Relevance Ranking in
WebGather’, J. Cimput. Sci. & Technol. Vol.16 No.5.
Motwani, R & Raghavan, P 1995, ‘Randomized Alogrithms’, Cambridge University
Press.
Neches, R, Fikes, R, Finin, T, Gruber, T, Patil, R, Senator, T & Swartout, WR 1991,
‘Enabling technology for knowledge sharing’, AI Magazine 12(3), pp.36-56
64
Page, L, Brin, S & Motwani, R 1998, ‘The PageRank citation ranking: Bringing order
to the Web’ Technical report, Computer Science Department, Stanford University.
Qianhong, P & Ju, W 1999, ‘Attribute theory-based text similarity computation’,
Computer Journal, 22(6):651-655.
Qun, L & Sujian, L 2002, ‘CNKI-based word semantic similarity computation’,
Computer Linguistics and Chinese Information Processing, 2002(7):59-76
Steichen, O & Daniel-Le, C 2005, ‘Bozec. Computation of semantic similarity within
an ontology of breast pathology to assist inter-observer consensus’, Computers in
Biology and Medicine, (4):l-21.
Studer, R, Benjanmins, VR & Fensel, D 1998, ‘Knowledge engineering: principles
and methods’, Data and knowledge engineering 25, pp.161-197.
Sujian, L 2002, ‘Research on semantic computation-based sentence similarity’,
Computer Engineering and Application, 38(7):75-76.
Weizhu, C, Ying, C & Yan, W 2005, ‘Categorization technology-based rank
algorithm for search engine – Category Rank’, Computer Application, 2005(5).
Xiaofeng, Z, Xinting, T & Yongsheng, Z 2006, ‘Ontology technology-based Internet
intelligent search research’, Computer Engineering and Design, 27(7):1194-1197.
Xuelong, W, Xuemei Z & Xiangwei L 2006, ‘Application and improvement of time
parameter in Hits algorithm’, Modern Computer.
Zhihong, D & Shiwei, T 2002, ‘Ontology research review’, Beijing University
Journal (Natual Science Edition), (5):730-738
65
APPENDIX A
Semantic Relations
In table 2, represents sentence, represents noun phrase, represents
individual noun phrase, represents category noun phrase, represents verb
phrase, represents original verb phrase, and represents prepositional
phrase. All the verb forms in rules include various kinds of deformations of their
verbs.
Seq.
No.
Semantic
relation
Extraction rule
1 ISA If
Then
2 AKO If
Then
3 Have If
Then
4 Can If
Then
5 Is If
Then
6 Part-Of If
Then
7 Composed
-Of
If
66
Then
8 Belong-To If
Then
9 Time If
Then or
10 Position If
Then
11 OthersIf
Then
Explanations:
Rule 1: If sentence can be expressed in form of , and is an
individual noun phrase, NP2 is a category noun phrase, then the semantic relation can
be extracted as .
Rule 2: If sentence can be expressed in form of , and is a
category noun phrase, as well as , then the semantic relation can be extracted as
.
Rule 3: If sentence can be expressed in form of , then the
semantic relation can be extracted as .
Rule 4: If sentence can be expressed in form of , then the semantic
relation can be extracted as .
Rule 5: If sentence can be expressed in form of , then the semantic
relation can be extracted as .
Rule 6: If sentence can be expressed in form of , then
the semantic relation can be extracted as .
Rule 7: If sentence can be expressed in form of or
, then the semantic relation can be extracted as
.
Rule 8: If sentence can be expressed in form of or
, then the semantic relation can be extracted as
67
.
Rule 9: If sentence can be expressed in form of , then the semantic
relation can be extracted as or .
Rule 10: If sentence can be expressed in form of , then the
semantic relation can be extracted as .
Rule 11: If sentence can be expressed in form of , and not
, then the semantic relation
can be extracted as .
68
APPENDIX B
“Linear Structure” Ontology
Linear Structure
Name Cardinality Type Other Facets
Attribute Multiple String Value=Stack, Linear List, String, Queue
Depth Required Single Integer Minimum=1, Maximum=4, Value=1, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=1.0, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=7.03, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.4, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Stack
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linear Structure, Linked Stack, Sequential Stack
Depth Required Single Integer Minimum=1, Maximum=4, Value=2, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=8.61, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.2, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Linked Stack
Name Cardinality Type Other Facets
Attribute Multiple String Value=Stack
69
Depth Required Single Integer Minimum=1, Maximum=4, Value=3, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.12, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=9.02, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Sequential Stack
Name Cardinality Type Other Facets
Attribute Multiple String Value=Stack
Depth Required Single Integer Minimum=1, Maximum=4, Value=3, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.12, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=9.03, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Linear List
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linear Structure, Sequential storage structure, Linked
storage structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=2, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.53, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=8.92, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.2, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Sequential storage structure
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linear List, Static Allocation, Dynamic Allocation
Depth Required Single Integer Minimum=1, Maximum=4, Value=3, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.24, default=0.0
70
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=9.31, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.2, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Static Allocation
Name Cardinality Type Other Facets
Attribute Multiple String Value=Sequential storage structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=4, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=10.46, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Dynamic Allocation
Name Cardinality Type Other Facets
Attribute Multiple String Value=Sequential storage structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=4, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=11.46, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Linked storage structure
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linear list, Single linked list, Circular linked list,
Double linked list, Double linked circular list
Depth Required Single Integer Minimum=1, Maximum=4, Value=3, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.35, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=9.81, Default=0.0
71
Out degree Required Single Float Minimum=0.0, Value=0.4, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Single Linked List
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linked storage structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=4, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=13.46, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Circular Linked List
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linked storage structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=4, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=10.36, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Double Linked List
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linked storage structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=4, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=12.46, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Double Linked Circular List
72
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linked storage structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=4, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=10.86, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
String
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linear structure
Depth Required Single Integer Minimum=1, Maximum=4, Value=2, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.06, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=8.41, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Queue
Name Cardinality Type Other Facets
Attribute Multiple String Value=Linked storage structure, Circular Queue, Linked
Queue
Depth Required Single Integer Minimum=1, Maximum=4, Value=2, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.18, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=8.5, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.2, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Circular Queue
Name Cardinality Type Other Facets
Attribute Multiple String Value= Queue
73
Depth Required Single Integer Minimum=1, Maximum=4, Value=3, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.12, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=9.5, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
Linked Queue
Name Cardinality Type Other Facets
Attribute Multiple String Value= Queue
Depth Required Single Integer Minimum=1, Maximum=4, Value=3, Default=1
Importance Degree Required Single Float Minimum=0.0, Maximum=1.0, Value=0.12, default=0.0
In degree Required Single Float Minimum=0.0, Value=0.1, Default=0.0
Information Amount Required Single Float Minimum=0.0, Value=9.61, Default=0.0
Out degree Required Single Float Minimum=0.0, Value=0.0, Default=0.0
Semantic Relation Required Single String Value=Entirety-part
74