Enhancing Personalized Search by Mining and Modeling Task Behavior
Personalized Ontology Learning for Enhancing Text Mining E...
Transcript of Personalized Ontology Learning for Enhancing Text Mining E...
Personalized Ontology Learning for
Enhancing Text Mining Effectiveness
Yan Shen
January 2013
Electrical Engineering, Computer Science
Science and Engineering Faculty
Queensland University of Technology
A final report submitted in partial fulfillment for the degree of Doctor of
Philosophy
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet
requirements for an award at this or any other higher education institution. To
the best of my knowledge and belief, the thesis contains no material previously
published or written by another person except where due reference in made.
Signature
Date 06/05/2013
Acknowledgements
I would like to express my sincere gratitude to the people who have assisted and
encouraged me during the research journey. In particular, first of all, I wish to
thank my supervisors - Professor Yuefeng Li, Associate Professor Yue Xu, and
Adjunct Professor Renato Ilannella. Without their consistent instructions and
strong supports, the research project could not be completed successfully. Beside
the research, Prof. Yuefeng Li as a well mentor demonstrates the positive attitude
and the spirit of being diligent in my life.
Queensland University of Technology deserves my honest gratitude for of-
fering an enjoyable working environment. Thanks to Science and Engineering
Faculty, and my school, Electrical Engineering and Computer Science, especially.
They have offered a number of training courses, and allowed me to attend the
conferences.
It is pleasure to acknowledge the great collaboration with the e-Discovery lab.
Special thanks to my previous colleagues, Dr. Daniel Tao, Dr. Susan Zhou, and
Dr. Abdulmohsen Algarni. Their advices and shared experience are significant
to solve the troubles during my study period.
vii
I owe a lot to my wife Lilly Xu, and dedicated to my parents and grandma.
Without their selfless loves and early education, the thesis would become an
impossible mission.
Many thanks to the blind thesis examiners, their valuable advices and con-
structive comments are important to refine the thesis and future work. Some of
the general concerns have not been realized before the examination.
Finally, thanks to Helen Whittle for the thesis proofreading, who made the
thesis more readable for native and non-native speakers alike.
Yan Shen
26 April 2013
viii
Abstract
Over the last decade, the majority of existing search techniques is either keyword-
based or category-based, resulting in unsatisfactory effectiveness. Meanwhile,
studies have illustrated that more than 80% of users preferred personalized search
results. As a result, many studies paid a great deal of efforts (referred to as col-
laborative filtering) investigating on personalized notions for enhancing retrieval
performance. One of the fundamental yet most challenging steps is to capture
precise user information needs. Most Web users are inexperienced or lack the
capability to express their needs properly, whereas the existent retrieval systems
are highly sensitive to vocabulary.
Researchers have increasingly proposed the utilization of ontology-based tech-
niques to improve current mining approaches. The related techniques are not only
able to refine search intentions among specific generic domains, but also to access
new knowledge by tracking semantic relations. In recent years, some researchers
have attempted to build ontological user profiles according to discovered user
background knowledge. The knowledge is considered to be both global and lo-
cal analyses, which aim to produce tailored ontologies by a group of concepts.
ix
However, a key problem here that has not been addressed is: how to accurately
match diverse local information to universal global knowledge.
This research conducts a theoretical study on the use of personalized ontolo-
gies to enhance text mining performance. The objective is to understand user
information needs by a “bag-of-concepts” rather than “words”. The concepts are
gathered from a general world knowledge base named the Library of Congress
Subject Headings. To return desirable search results, a novel ontology-based
mining approach is introduced to discover accurate search intentions and learn
personalized ontologies as user profiles. The approach can not only pinpoint
users’ individual intentions in a rough hierarchical structure, but can also in-
terpret their needs by a set of acknowledged concepts. Along with global and
local analyses, another solid concept matching approach is carried out to address
about the mismatch between local information and world knowledge. Relevance
features produced by the Relevance Feature Discovery model, are determined as
representatives of local information. These features have been proven as the best
alternative for user queries to avoid ambiguity and consistently outperform the
features extracted by other filtering models. The two attempt-to-proposed ap-
proaches are both evaluated by a scientific evaluation with the standard Reuters
Corpus Volume 1 testing set. A comprehensive comparison is made with a num-
ber of the state-of-the art baseline models, including TF-IDF, Rocchio, Okapi
BM25, the deploying Pattern Taxonomy Model, and an ontology-based model.
The gathered results indicate that the top precision can be improved remark-
x
ably with the proposed ontology mining approach, where the matching approach
is successful and achieves significant improvements in most information filtering
measurements.
This research contributes to the fields of ontological filtering, user profiling,
and knowledge representation. The related outputs are critical when systems are
expected to return proper mining results and provide personalized services. The
scientific findings have the potential to facilitate the design of advanced preference
mining models, where impact on people’s daily lives.
xi
Contents
Acknowledgements vii
Abstract ix
List of Figures xix
List of Tables xx
Notation, Terminology, and Abbreviations xxi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Questions and Motivations . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Significance and Major Contributions . . . . . . . . . . . . . . . . 13
1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Literature Review 19
2.1 Classic Information Retrieval Models . . . . . . . . . . . . . . . . 21
xiii
2.2 Web Personalization . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Explicit and Implicit Feedback . . . . . . . . . . . . . . . . 29
2.2.2 User Information Needs . . . . . . . . . . . . . . . . . . . 30
2.2.3 User Profiling . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Knowledge Discovery and Pattern Mining . . . . . . . . . . . . . 34
2.3.1 Process of Knowledge Discovery . . . . . . . . . . . . . . . 35
2.3.2 Tasks of Data Mining . . . . . . . . . . . . . . . . . . . . . 36
2.3.3 Pattern Mining and Association Rules . . . . . . . . . . . 38
2.3.4 Association Rules and Frequent Itemsets Mining . . . . . . 40
2.3.5 Efficient Mining Algorithms . . . . . . . . . . . . . . . . . 41
2.3.6 Mining Diverse Patterns . . . . . . . . . . . . . . . . . . . 43
2.4 Ontology Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Ontology-Based Techniques . . . . . . . . . . . . . . . . . 47
2.4.2 Concept Mapping . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.3 Semantic Relations . . . . . . . . . . . . . . . . . . . . . . 53
2.4.4 Specificity and Exhaustivity Ontology Mining . . . . . . . 54
2.4.5 Knowledge Base: LCSH and its Subject Headings . . . . . 55
2.4.6 Ontology Construction Language . . . . . . . . . . . . . . 59
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Ontology-Based Technique for Search Intent Mining 71
3.1 Background and Overview of the Approach . . . . . . . . . . . . . 72
3.2 World Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . 75
xiv
3.3 Personalized Ontology Learning . . . . . . . . . . . . . . . . . . . 79
3.3.1 Subject-Based Model . . . . . . . . . . . . . . . . . . . . . 79
3.3.2 Semantic Extraction . . . . . . . . . . . . . . . . . . . . . 80
3.3.3 In-Level Ontology Mining Method . . . . . . . . . . . . . . 82
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4 Relevance Feature Matching 91
4.1 Design and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.1 Definitions of Patterns and Closed Patterns . . . . . . . . 94
Closed Sequential Patterns . . . . . . . . . . . . . . . . . . 95
4.1.2 Global Knowledge Base: the LCSH . . . . . . . . . . . . . 96
4.2 Relevance Feature Acquisition . . . . . . . . . . . . . . . . . . . . 97
4.2.1 Specificity of Low-Level Features . . . . . . . . . . . . . . 98
4.2.2 Term Weighting . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3 Concept Matching Method . . . . . . . . . . . . . . . . . . . . . . 104
4.3.1 Limitations of Query-Based Method . . . . . . . . . . . . . 104
4.3.2 Concept Intension & Extension . . . . . . . . . . . . . . . 106
4.3.3 Distributed Matching . . . . . . . . . . . . . . . . . . . . . 110
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5 Evaluation 115
5.1 Evaluation Environment . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . 116
xv
5.1.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . 119
5.1.3 Experimental Measurements . . . . . . . . . . . . . . . . . 123
5.2 Experiment Design and Settings . . . . . . . . . . . . . . . . . . . 124
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.1 Evaluation of Ontology-Based Mining Approach . . . . . . 127
5.3.2 Evaluation of Distributed Matching . . . . . . . . . . . . . 131
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6 Conclusion 137
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Bibliography 143
xvi
List of Figures
1.1 An insight of the classic IR model [38] . . . . . . . . . . . . . . . 2
1.2 Global concepts and local information matching . . . . . . . . . . 10
2.1 The coverage of literature review . . . . . . . . . . . . . . . . . . 20
2.2 The steps of knowledge discovery in databases . . . . . . . . . . . 37
2.3 The constructed ontology with the proposed semantic relations . 54
2.4 The concept of Specificity and Exhaustivity . . . . . . . . . . . . . 55
2.5 Subclass relationships between OWL and RDF/RDFS [6] . . . . . 64
3.1 Overview of the ontology-based approach . . . . . . . . . . . . . . 74
3.2 Knowledge classification . . . . . . . . . . . . . . . . . . . . . . . 76
3.3 The screenshot of subject-based Model . . . . . . . . . . . . . . . 80
3.4 The extracted semantic relations . . . . . . . . . . . . . . . . . . 81
3.5 A Segment of the Subject Headings . . . . . . . . . . . . . . . . . 82
3.6 The backbone of in-levels hierarchy . . . . . . . . . . . . . . . . . 83
4.1 Overview of the relevance feature matching approach . . . . . . . 93
4.2 Occurrence of terms in the first 50 RCV1 topics . . . . . . . . . . 105
xvii
4.3 Difference between extension and intension . . . . . . . . . . . . . 110
5.1 RCV1 data structure . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 Primary user interface of the coded pre-processing program . . . . 120
5.3 Experiment design for POM . . . . . . . . . . . . . . . . . . . . . 126
5.4 Top 20 precision for three stages . . . . . . . . . . . . . . . . . . . 129
5.5 Comparison of 11-points in first 50 topics . . . . . . . . . . . . . . 130
5.6 11 points result after matching . . . . . . . . . . . . . . . . . . . . 134
5.7 Original 11 points result . . . . . . . . . . . . . . . . . . . . . . . 135
xviii
List of Tables
2.1 An example of vertical format of database . . . . . . . . . . . . . 42
2.2 Types of subject headings . . . . . . . . . . . . . . . . . . . . . . 59
2.3 RDF schema constructors . . . . . . . . . . . . . . . . . . . . . . 62
2.4 Comparison between three species of OWL . . . . . . . . . . . . . 67
3.1 Examples for redefined relations . . . . . . . . . . . . . . . . . . . 77
4.1 Example of pattern mining . . . . . . . . . . . . . . . . . . . . . . 101
5.1 Comparison results for different parameter settings . . . . . . . . 126
5.2 Overall performance of first 50 topics . . . . . . . . . . . . . . . . 128
5.3 Number of terms extracted by all the models . . . . . . . . . . . . 131
5.4 Comparison results after matching . . . . . . . . . . . . . . . . . 132
5.5 Comparison of deploying results . . . . . . . . . . . . . . . . . . . 133
5.6 Comparison of original results . . . . . . . . . . . . . . . . . . . . 134
xix
Notation, Terminology, and
Abbreviations
Notation
Θ An ontology.
Terminology
Query The data structure given by a user to information
gathering systems for the expression of an information
need.
Abbreviations
WWW World Wide Web
IR Information Retrieval
HTML Hyper Text Markup Language
XML Extensible Markup Language
xxi
OWL Web Ontology Language
RDF Resource Description Framework
LC Library of Congress
LCC Library of Congress Classification
DDC Dewey Decimal Classification
LCSH Library Congress of Subject Headings
RCV1 Reuters Corpus Volume 1
RFD Relevance Feature Discovery model
SPE Positive Specific Group in RFD
PTM Pattern Taxonomy Model
PDS PTM model using deploying weighting
QUT Queensland University of Technology
NIST National Institute of Standards and Technology
xxii
Chapter 1
Introduction
1.1 Background
The Web is now considered to be an essential part of human life. It is seen
as a powerful tool for seeking and sharing information, and is widely used by
companies, universities, organizations, and individuals. Due to its popularity,
diverse resources are made available on the Web an exponential rate everyday [49].
Hence, the development of effective ways to retrieve the expected information is
a continuing and concern in the fields of Information Retrieval (IR) and Data
Mining (DM) [10,89].
Web searching is a significant remedy to alleviate retrieval problems. It can
be studied as Web mining, which utilizes data mining methodologies to extract
1
2 Chapter 1. Introduction
useful information or patterns from large amounts of Web documents and ser-
vices [10]. Hence, the combination of data mining and Web searching creates
the concept of Web mining. Its basic process is nearly the same as the classi-
cal IR process depicted in Figure 1.1. Existing search mechanisms have been
further developed by various sophisticated techniques in order to properly filter
irrelevant information. However, the majority of these techniques are mainly lim-
ited to keyword-based searches(e.g. Google search, AltaVista, InfoSeek, All the
Web) or category-based searches(Yahoo!, Google Directory) [21]. These search
techniques still suffer from two distinct problems: information overload and mis-
match [49,55,62,94]. The information overload occurs when one search returns a
large number of results, which users have to manually traverse to obtain valuable
information. It is time-consuming and frustrating for the users. The mismatch
problem means that such search models are inefficient to gather the appropriate
information to satisfy users’ needs. One main reason is that the techniques follow
the “bag-of-words” approach and primarily focus on the terms themselves rather
than their inter-relations and meanings [98].
Figure 1.1: An insight of the classic IR model [38]
Personalization is another challenge in Web searching. Current information
filtering systems lack intelligent capability to judge user preferences. To take a
classic example, when a user types “apple” for searching, a conventional search
1.1. Background 3
model will find the results for “apple” fruit and “apple” IT product at the same
time. The results are not tailored and applicable to individual users [49]. In other
words, the problem is referred to as ambiguity.
Studies have illustrated that more than 80% of users preferred personalized
result [25, 26]. To meet the users search expectations, user profiling has been
investigated for more than a decade. It plays an important role in personalized
searching because it can refine user search intentions by referring to their personal
data. This is especially the case with adaptive filtering models, which start with
user profiles or a limited number of feedback documents to generate user profiles,
and then operate filtering tasks [67,76,134]. Unlike the traditional search query,
user profiles are persistent, and tend to reflect long-term information needs. When
a user’s information need is stable over a long period of time, a filtering system can
gain sufficient feedback from which to create the satisfactory profile by interacting
with the user. As a result, the information delivered would be customized and
cater more closely to the user’s need [66,87].
In fact, extracting a user profile is a complicated and reiterative process for
collaborative filtering. Diverse sorts of data on click-through, scrolling frequency,
browsing history, and page staying time are usually required to collect in order
to describe users’ interests and behaviors [62]. It suffers from scalability [91]
and cold-start [92] problems. Another common component of user profile is feed-
back, namely: relevance feedback and pseudo feedback. Relevance feedback is
produced by interacting with users, whereas the pseudo feedback is generated
4 Chapter 1. Introduction
by systems [116], such as top ranking. With respect to the collections, different
types of user profiles can be established for understanding background informa-
tion so as to capture the user’s specific demands. Their significance has been
demonstrated by numerous information filtering models [13, 67, 85]. However,
these models encounter some distinct drawbacks. Their user profiles are either
expensive in extraction or inaccurate in description [10]. Meanwhile, they are
incompatible with new knowledge generation and acquisition.
To overcome these limitations, ontology-based techniques have been put for-
wards as a crucial portion of advanced searching in the last decade. Ontologies
constitute a formal conceptualization of a particular domain of interest that is
shared by a group of people [38,58,71]. Their components are primarily made up
of explicit concepts and their inter-relationships. In computer science, ontology-
based techniques are widely applied in sharing information among human or soft-
ware agents, reusing domain knowledge, and offering scientific classification [102].
In addition, they can help to disambiguate word senses in the form of a taxo-
nomic structure. For example, if the word “jaguar” is allocated in the taxonomy
of “jaguar - cat - animal”, it is easy to distinguish that the “jaguar” is not a
British car because it has a parent class of “cat”. This characteristic enables
a search to obtain hints towards resolving vague expressions (such as polysemy,
synonym, hypernym, and metonymy) [104].
More recently, researchers [105,109] have attempted to build ontological user
profiles. Their aim is to understand user background knowledge properly by a
1.2. Questions and Motivations 5
group of relevant on-topic concepts rather than features mined by the pure data
mining techniques. Some of the gathered knowledge is integrated from two sides,
namely: global analysis and local analysis [104, 105]. The global analysis is used
to employ a global knowledge base (ontlogies, thesauruses, or online knowledge
bases) that mirrors the content of the Web for common knowledge representa-
tion, whereas the local analysis investigates native information or observes user
behaviors from local instances [104]. According to the generated ontology-based
profiles, the related work outperforms previous techniques and shows encouraging
results in search effectiveness. However, some problems are not fully addressed,
and a theoretical support in the form of pre-conditions is urgently required for
the production of ontological profiles.
1.2 Questions and Motivations
The discussion in the previous section demonstrates that the existence and acqui-
sition of user profiles is important to achieve Web personalization. The identified
problems indicate an urgent need to conduct a comprehensive study of personal-
ized ontologies, especially in regard to the profiling aspect. This section lists all
questions arising from the conducted research. This section focuses on the three
essential questions that arose in the present thesis:
• Rather than user queries, what is the alternative to capturing user infor-
mation needs?;
6 Chapter 1. Introduction
• How can local information be associated with global knowledge in a world
ontology?; and
• How can weights be properly assigned to the terms in the concepts?.
The motivations to address the questions are also described in detail.
Imprecise and uncertain information, or so-called “noisy information, is mainly
caused by three factors while implementing an IR or DM system. These factors
are the representations of user queries, the representations of documents, and the
relevance indicants between user queries and documents [5,117]. User queries are
generally treated as containers to transfer information needs from the user side
to the system side. To guarantee search quality, offering a good query is impor-
tant. Thus, there exists a recognized hypothesis that if user information needs
can be captured and interpreted for retrieval models, more useful and meaningful
information can be gathered.
A search query consists of a short number of keywords. If a person is asked to
determine the correct keywords in participating documents, she/he would have to
rely upon some background knowledge accumulated over time from other docu-
ments and experiences [24,26]. In reality, this is also true for existing Web search
techniques. The majority of retrieval methods require a precise query upfront to
express search intentions, which is fairly difficult for any user who has no back-
ground knowledge or past experience. Another factor is that Web users formulate
their queries diversely because of different personal perspectives and terminologi-
cal habits and vocabularies [33,70]. As a result, capturing user information needs
1.2. Questions and Motivations 7
through a given query is challenging and unrealistic. It is difficult for Web users
to deliver perfect information at the beginning. The problem leads to the first
question arising in our study:
1. Rather than user queries, what is the alternative to capturing
user information needs?
Instead of user queries, relevance features can be adopted in order to capture
users’ information needs. These features are called “relevance features” because
they need to be extracted based on relevance feedback, where a user judges the
results from previous retrievals and tells the system whether the documents are
relevant or not [65]. The Reuters Corpus Volume 1 (RCV1) dataset contains two
types of user feedback. One is positive feedback which is used to indicate that a
document is relevant. The other feedback is negative feedback which is used to
point out the irrelevant documents. According to the feedback, a model called
the Relevance Feature Discovery (RFD) model [5, 65] classifies all features into
three subsets: positive specific, general, and negative specific respectively. Al-
though the RFD claims that negative documents are useful to revise user needs,
here we only adopt the positive specific partial features from positive documents
throughout the thesis since they are sufficient to reveal user needs straightfor-
wardly in an optimal amount of feature space. A preliminary analysis was made
to support the accuracy of this decision.
Due to the fact that feature selection can speed up computation [125], most
information retrieval and filtering models also extract different amount of features
8 Chapter 1. Introduction
before implementing a weighing algorithm. For traditional IR models (including
Boolean, vector space, and probabilistic models), their processes all begin by iden-
tifying each document through a set of representative keywords. Later, a feature
selection is carefully assigned to select optimal features [65, 69]. The selection is
usually dependant on a ranking algorithm or a threshold control which attempts
to establish a simple sequence of all the keywords [69]. The selected features
are almost all in the term-based format, and are often utilized in the remaining
steps, such as indexing, filtering and delivering information, and classifying [69].
In addition, the feature selection is a common technique for addressing overfit-
ting [125]. In a restricted sense, the representative terms have some meanings of
their own or correlations with other terms but both completely disregarded.
To overcome the limitations, the text mining research community began to
pay more attention to phases or patterns instead of terms because it was be-
lieved that phases or patterns would be more discriminative [65,119,133]. One of
the well-known pattern-based models the Pattern Taxonomy Model (PTM) [118]
was introduced by Wu et al. in 2006. It adopts not only the concept of closed
patterns to prune irrelevant patterns, but also combines sequential pattern min-
ing to discover useful patterns. Rather than directly working on terms, a set of
patterns {p1, p2, ..., pm} is extracted for candidate features. Eventually, the pat-
terns are deployed into terms which are weighted according to their appearances.
Similar to most information filtering models, the PTM only considers positive
feedback and ignores negative feedback. Yet, it has shown a certain improvement
1.2. Questions and Motivations 9
in effectiveness.
The RFD, term-based, and PTM models all contain a feature selection process
for their own purposes. It is difficult to identify which model can produce the
best features as the alternative to user queries. This problem motivated the
present research to conduct a comprehensive comparison of a number of well-
known models that generate features. Since all the features are term-based and
obtained through local RCV1 training documents, we recognize them as low-level
local information. In order to take advantage of ontology filtering to constrain
search spaces and interpret high-level concepts, the second question arises:
2. How can local information be associated with global knowledge
in a world ontology?
Ontologies are commonly considered to deal with expert knowledge repre-
sentations [102]. They cover adequate concepts and their explicit relationships
to describe the world, and hence are referred to as the world knowledge base.
The world knowledge base is a taxonomic specification of commonsense knowl-
edge acquired by people through their experience and education [104]. Due to
the fact that commonsense knowledge is the collection of facts and information
that an ordinary person is expected to know, the ontologies have been identified
as the suitable representations of global knowledge. In 2011, Tao et al. [106]
explored a way to produce personalized ontologies by analysing both local infor-
mation and global knowledge. The local instances were adopted from the real
library database at the Queensland University of Technology. The global knowl-
10 Chapter 1. Introduction
edge relied on concepts existing among an universal ontology named the Library
Congress of Subject Headings (LCSH). They assumed the information items in
the local repository had a connection to the concepts in the LCSH. However, the
assumption left an emerging issue about how to reasonable build the connection.
Figure 1.2 illustrates a picture of the matching scenario.
Figure 1.2: Global concepts and local information matching
Local information is based on individual characteristics and interests, it is
unique or informal. The information is usually factual, but unstructured, and
in many cases it is textual. Inversely, global knowledge is learned from relevant
education and experience, which is general or structured. Information scientists
consider taxonomies, subject headings, and classification schemes as represen-
tations of knowledge [71, 114, 132]. These differences may lead to information
mismatch and knowledge shortage. As a result, a goal of this thesis is to pro-
pose an appropriate approach to match local information to global knowledge.
Within the existing methods [104, 115, 131, 134], the basic idea is to use condi-
1.2. Questions and Motivations 11
tional probability to determine relevant concepts c for describing the themes of
the local instances, P (c|F ) = P (c∩F )/P (F ), where F is a set of features discov-
ered in the local instances. However, this technique usually achieves a low level of
performance because of the mismatch problem [80, 111]. Here in particular, the
mismatch means some relevant specific concepts have been omitted. The problem
occurs when the popular features (usefully very frequent features) match many
relevant but general concepts (usually appearing in the top of the ontology); but
do not match some relevant specific concepts (usually appearing at the bottom of
the ontology). This problem motivates us to develop an ontology-based mining
approach for distinguishing the concepts’ generality and specificity.
Personalized ontologies can be constructed by adopting the proposed auto-
matic matching method to discover relevant concepts from the common knowl-
edge base. As described before, the concepts are considered for capturing user
information needs, but the process to measure and evaluate the discovered con-
cepts remains unknown. Unlike other filtering systems [39, 98, 109], a number of
data sources are available (such as Web pages, documents, Extensible Markup
Language (XML) files) to rank documents according to term weights and simi-
larity measures. The research conducted by Tao et al. (2011) took into account
that the library database merely retains a huge amount of subjects. Each subject
is understood as a single concept, which consists of a few terms. In such a case,
applying traditional weighing schemes is not feasible because of the constrained
sources. However, in terms of the foundation of information filtering systems, a
12 Chapter 1. Introduction
term weight pair is generally required at first. Thus, the third question arises:
3. How can weights be properly assigned to the terms in the con-
cepts?
To answer this question, the thesis attempts to use two methods: the lan-
guage modelling and deploying method. Language modelling is a quite general
formal approach to IR, with many variant realizations [50, 76]. The original and
basic method for using language models in IR is the query likelihood model.
According to the similar theory, we construct from each subject si in the col-
lection of matched subjects SC, where SC = {s1, s2, ..., sn}. The low-level
relevance features are utilized as queries Q. Our goal is to rank subjects by
rel(si) = |si ∩ Q|/|Q|, where the probability of a subject is interpreted as the
likelihood that it is relevant to the query.
Regarding the deploying method [118], our evaluation of term supports (weights)
is different to the other term-based approaches. In the term-based approaches,
the evaluation of an assigned term weight is based on its appearance in docu-
ments. Within our model, terms are weighted according to their appearance in
matched concepts.
Due to the uncertainty, we cannot determine which method is better to weight
all the terms appearing for the matched concepts. Thus, an empirical comparison
of these two methods is conducted after gathering the results of the proposed
approach and all baseline models.
1.3. Significance and Major Contributions 13
In sum, this section has outlined the three important questions that motivate
us throughout the whole study. Some general information was also provided to
understand the context of the questions. The related discussions outlined the
concepts of the corresponding solutions.
1.3 Significance and Major Contributions
This research is particularly significant due to the nature and fast growing use
of the Web; Web users are demanding quality information rather than quan-
tity. The popular search engines are, however, struggling to deliver the desired
information. This has prompted the present study to explore a concept-based
filtering model that has the semantic scalability and adaptability to serve per-
sonalized content to individual users. This research contributes to the domains
of ontological user profiling, information filtering, and knowledge representation
and engineering. The related outputs are critical as filtering systems are expected
to return satisfactory retrieval results and offer customized services.
In this thesis, two approaches are proposed: 1) an ontology-based mining ap-
proach for producing personalized ontologies, and 2) a relevance feature matching
approach to link the local information represented by relevance features with the
global concepts allocated in a world knowledge base. Both approaches hold the
same objective which is to capture proper user information needs so as to im-
prove text mining effectiveness. A systematic scientific evaluation is conducted
to measure the performance of the approaches. Their relative contributions are
14 Chapter 1. Introduction
listed below:
• Ontology-Based Learning Approach
1. We propose an ontology-based learning approach to allocating personal
search intentions in a common ontology;
2. We introduce an innovative in-levels mining method in order to ensure
a certain level of user information needs among an abstract hierarchical
structure;
3. We develop a new tracking method for knowledge representation in a
two dimensional zoomable way.
• Relevance Features Matching Approach
1. We develop a solid matching approach for transferring low-level terms
to high-level concepts;
2. We provide a promising methodology for evaluating term weights based
on their distributions in subjects (concepts) and queries (features). The
method also demonstrates that the applied weighting algorithm performs
better than common deployment in evaluation;
3. We outline a method to connect local information with worldwide knowl-
edge for extracting personalized ontologies.
1.4. Publications 15
1.4 Publications
The section lists all published work.
• Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen Algarni, &
Xiaohui Tao. An ontology-based mining approach for user search intent dis-
covery. In Proceedings of the Sixteenth Australasian Document Computing
Symposium (ADCS’11), pp. 39-46, Canberra, Australia.
• Hang Jin, Yanming Feng, & Yan Shen. Accurate urban road model recon-
struction from high resolution remotely sensed imagery based on Support
Vector Machine and Gabor filters. In Joint Urban Remote Sensing Event
(JURSE 2011), 11-13 April 2011, Munich, Germany.
• Yan Shen, Yuefeng Li, Yue Xu, & Xiaohui Tao. Matching Relevance Fea-
tures with Ontological Concepts. Accepted by the International Workshop
on Web Personalization and Recommender Systems (WPRS2012) in con-
junction with the IEEE/WIC/ACM International Conference on Web In-
telligence (WI’12), 4-7 December, Macau, China.
• Xiaohui Tao, Yuefeng Li, Bin Liu, & Yan Shen. Semantic Labelling for Doc-
ument Feature Patterns Using Ontological Subjects. Accepted by the 2012
IEEE/WIC/ACM International Conference on Web Intelligence (WI’12),
Macau, China, 4-7 December 2012.
• Yan Shen, Yuefeng Li, & Yue Xu. Adopting Relevance Feature to Learn
16 Chapter 1. Introduction
Personalized Ontologies. Accepted by the Twenty-Fifth Australasian Joint
Conference on Artificial Intelligence (AI’12), 4-7 December, Sydney, Aus-
tralia.
1.5 Thesis Organization
This thesis is organized in six chapters.
Chapter 2 reviews a wide range of significant related works on useful pat-
tern discovery methods, ontology-based user profiling, and the similarity concept
matching procedure. The chapter also covers the necessary knowledge to under-
stand existing techniques regarding the basic information retrieval models, Web
personalization, and personalized ontologies.
Chapter 3 introduces an ontology-based mining approach. The model aims to
discover and define user search intentions in the in-levels hierarchical structure.
The chapter explains the entire approach step-by-step.
Chapter 4 outlines a distributed matching approach to address the problem of
local information and global knowledge mismatch. The matching approach incor-
porates data mining and ontology learning. The chapter begins with an overview
of feature acquisition based on pattern mining technique. Then, a primary con-
trol is developed on the learning side to involve the relevant concepts from large
scale ontology.
Chapter 5 discusses the evaluation for the approaches proposed in Chapter
3 and 4. The evaluation is based on a number of scientific experiments. To en-
1.5. Thesis Organization 17
sure scalability and reliability, different types of baseline are used for comparison
through a number of information filtering measurements. The gathered results
are displayed and discussed.
Chapter 6 brings the dissertation to a conclusion. The important points and
findings are summarized. Recommendations and plans for future investigation
are also outlined.
Chapter 2
Literature Review
This chapter presents a preliminary literature review of the present study. The
purposes are: 1) to identify our particular research questions; and 2) to under-
stand the current knowledge including substantive findings as well as theoretical
and methodological contributions to personalized ontologies.
As depicted in Figure 2.1, all the involved domains are outlined clearly. The
red circle represents the coverage of the conducted review. Some contents are
overlapping with multiple domains. We divided the whole review into two cat-
egories. The first category is in the view of computer-centered. It covers fun-
damental Web searching techniques in information retrieval, popular knowledge
discovery approaches in machine learning, and various filtering methods in data
mining. All relevant aspects including efficient indexing, useful feature selection,
19
20 Chapter 2. Literature Review
query processing, weighting algorithms, are mainly working on machine processes
to improve the “quality” of answer sets [1]. The other category concentrates on
Web personalization including user profile acquisition, user information needs
capturing, and the representative ontology-based methods. The specified factors
consider in the human-centered view, which commonly make uses of user prefer-
ences to understand user central needs, and then affect the operation of retrieval
models.
Figure 2.1: The coverage of literature review
The aforementioned classification is also based on different retrieval functions.
Known that a typical information retrieval system is to return potential valuable
documents from a large set of documents in accordance with criteria specified by
users. Yet, this function is named document search, or said ad-hoc, which can
be referred to the computer-centered view. It is the selection of documents from
a collection of documents. The other one is called document routing, or said
2.1. Classic Information Retrieval Models 21
filtering, which can be linked to the human-centered view. It is the dissemination
of incoming documents to correct users on the basis of user profiles.
2.1 Classic Information Retrieval Models
Over the last two decades, the mainstream of Web search models was limited
by either keyword-based [55,94] or category-based [95,98] technique. It is noted
that these models play an essential role for gathering statistically significant infor-
mation because they share common tasks and problems of information retrieval
and information filtering. This section chooses some representatives to expose
the core theories, approaches, and evaluation measurements. All of them can
be studied into boolean, vector, and probabilistic models. These classic models
consider that each document is described by a set of representative keywords at
the early stage. Regarding to indexed keywords, the Information Retrieval (IR)
models can process feature selections, and compute weights for all the keywords.
Finally, all the documents can be ranked according to the computed weights.
Generally speaking, the documents in the top position or with high similarity
scores will be considered as results for users. [76,107].
Term Frequency-Inverse Document Frequency
A traditional theory [55] used for Web searching firstly builds an index of all
documents. The index can be utilized to look up whether the indexed keywords
are appeared in the documents. Term Frequency-Inverse Document Frequency
22 Chapter 2. Literature Review
(TF-IDF) was designed to facilitate this process in the traditional Web searching
processes [52, 109]. To ensure search quality, a preliminary work is required to
specify the importance of each term in document. The importance of a term ti
within a document dj can be measured by considering its frequency, which simply
defined as below:
TFi,j =ni,j∑k ni,j
where numerator ni,j is the number of occurrences of ti in dj, and the denominator
is the sum of the occurrences of all terms in dj. The inverse document frequency
is a measure of the general importance of the term, define as below:
IDFi = log |D||{d : ti∈d}|
|D|refers to the total number of documents. | {d : ti ∈ d} | refers to the number
of documents that involves the term ti. Finally, the TF-IDF is generated as below
to measure the weight of term frequency.
(TF − IDF )i,j = TFi,j ∗ IDFi,j
Similarity Measurement
Cosine similarity is one of the most popular algorithms that is widely used by
many search mechanisms [29, 68, 94, 98, 105, 109]. The aim of using cosine simi-
larity is to know how similar between two vectors, and then their relations can
be inferred by thinking about the similarity value. The vectors could be diverse
2.1. Classic Information Retrieval Models 23
elements, like terms, documents, concepts, patterns, objects or subjects. The
cosine similarity can be also seen as a method of normalizing document length
during comparison. There are many variants, the basic formula is provided here
as follows:
Similarity Measure = cos(θ) = A·B‖A‖·‖B‖
The attribute vectors A and B are usually the term frequency vectors of two
documents.
Query Processing and Feature Selection
Web searching is becoming increasingly complex [33, 55]. The existing forms of
query are diverse including words, phrases, questions, passages, and even docu-
ments. Instead of a single document or answer, the returned result is frequently
an array of relevant information strengthened by precise navigation to related
information and topics that can help searchers to discover insightful results or
get more specific information. Query processing is a necessary step to retrieve
consistently superior results. Our proposed approach endeavours to know what
kinds of information is available on query, and how they can be extracted.
During searching, a success process comes from understanding what users are
expressing from their queries [24,70]. Some user queries are simply stated, while
others are stated in a boolean format (“apples AND iphone OR ipad”). Hence, a
searching platform must have a range of tools in order to accurately identify and
then reformulate what is information being required from the entered query [24].
24 Chapter 2. Literature Review
In order to obtain better performance in the text retrieval systems, many
attempts [15, 63] have been worked on another aspect - query expansion. The
query expansion with a generic ontology, like WordNet [78], has been shown to
potentially relevant to improve recall, as it permits matching a query to relevant
documents that do not contain any of the original query terms. Voorhees [113]
manually expanded 50 queries over a TREC collection using WordNet, and ob-
served that expansion was useful for short, incomplete queries, but not guaran-
teeing for complete topic statements. Further, for short queries, the automatic
expansion is not trivial since it may degrade rather than enhance retrieval per-
formance.
Feature selection is a fundamental stage for the majority of IR models. It
intends to remove non-informative features according to corpus statistics and to
improve the accuracy of classification by eliminating useless features. Usually, a
set of keywords is applied as elements in the vector of the feature space. The
bag of words can be obtained by diverse term selection approaches both in the
context of IR and data mining. Furthermore, the feature selection, i.e. selecting
a subset of features available for describing the data before applying a learning
algorithm, is a common technique for addressing over-fitting issue [69, 125]. It
has been widely observed that the feature selection can be an useful strategy to
simplify or speed up computations. In addition, it can lead to few loss in clas-
sification quality while employed appropriately. Nevertheless, general theoretical
performance guarantees are modest and it is often difficult to claim more than
2.1. Classic Information Retrieval Models 25
a vague intuitive understanding of why a particular feature selection algorithm
performs well when it does. Indeed, selecting an optimal set of features is in gen-
eral difficult, both theoretically and empirically; hardness results are known [10],
and in practice greedy heuristics are often employed [114,125].
For term-based feature selection approaches, they struggle against the prob-
lem of selecting appropriate number of features among an enormous set of terms
to guarantee the retrieval efficiency [133]. Like some information gain mod-
els [93, 123], they are restricted on term statistics in the entire collection, but
do not take relevance information (e.g. user feedback) into account. As an ex-
tension of the term-based approaches, pattern mining techniques are investigated
by data mining communities for many years. These techniques extract useful
patterns from large data collections instead of term-based features. Some stud-
ies categorize the extracted patterns into frequent, closed, and closed sequential
patterns [65]. Simultaneously, they measure the specificity of pattens explicitly
according to relevance feedback from human beings. Pattern taxonomy model
(PTM) was first introduced by Wu et al. [119] in 2004. It improves search perfor-
mance by using both frequent and closed sequential patterns. A more advanced
mining method of the PTM (PDS) for using frequent patterns in text was pro-
posed by Wu et al. [118] in 2006. This method focuses on addressing the difficul-
ties of using specific long patterns in text by using patterns to weight accurately
low-level terms based on their distributions in the patterns. Later on, it has been
further researched in [67] to develop a two-stage model for irrelevant information
26 Chapter 2. Literature Review
filtering. Li et al. [65] build the Relevance Feature Discovery (RFD) model by
mining patterns from positive and negative documents respectively. The related
output shows potentials for enhancing information filtering and user profiling.
Precision and Recall performance measuring schema
In the field of IR, the effectiveness is usually measured by using the precision
and recall performance measuring schema [55, 68, 98, 105]. The precision of a
set of results is the ratio of the number of relevant documents retrieved to the
total number of documents retrieved. It is an effective standard to measure how
well a system retrieves only relevant results. In contrast to the precision, recall
is used to measure all the relevant documents in terms of the entire collection.
Consequently, there is a strong argument with respect to the purpose of searching,
the recall is less important than precision, because users usually concern about
the top-K ranking results with high precision, they do not prefer that all the
retrieved documents are relevant to their queries.
Whilst Web searching has been recognized as a supplicated tool for IR, a
variety of limitations led by the existing techniques can still be discovered easily.
The tradition solution to the issue of precision and recall employs keyword-based
search techniques [10]. The results are only retrieved when they contain accurate
keywords specified by the user. Due to this reason, these techniques suffer from
two serious problems, resulting in the low precision.
Here, we summarises two main reasons to explain why the classic IR Models
2.2. Web Personalization 27
cause low performance:
• Users are usually unable to express their information needs accurately [49,
55].
• The keyword-based search techniques are too sensitive to vocabulary [7].
Many documents containing the desired semantics are disregarded.
2.2 Web Personalization
The volume of information on the Internet is increasing with unpredictable speed
every day. People often suffer from the issues of information mismatch and over-
load. To alleviate this problem, Shahabi and Chen [95] argued that personaliza-
tion is one of the effective remedies to customize the Web environment for users.
They [95] outlined two components within the domain of Web personalization.
One is the recommendation systems. The other is personalized Web search sys-
tems. Even though the existing Web search engines are considerably successful
and useful, the majority of them are not personalized Web search systems. They
are good at discovering pages relevant to a query (typically keyword-based) us-
ing the specific algorithms, but disregard who submits the query. As a result, no
matter who inputs a query, the search results would be exactly the same. Further-
more, how to acquire user information needs efficiently and effectively is another
essential research concern. Thus, this section focuses on reviewing the relevant
knowledge of personalized Web search systems, and comprehending some existing
28 Chapter 2. Literature Review
approaches to utilize both explicit and implicit information from user profiles.
In recent years, lack personalization is a common concern for most keyword-
based and category-based Web searches. Existing approaches cause users have to
spend a lot of time for searching and preliminary reviewing of a large amount of
listed information. The goal of personalized searching is to help user to optimize
the process of information retrieval according to their preferences [87]. However,
their processes are more complicated than traditional processes. Two basic re-
quirements are always needed for the personalized searching [126]: 1) user interest
is known and is well suitable for workplace setting. User profile is treated as a
common workplace , which usually utilizes to store user’s preference, and 2) the
mapping between user’s interest and search domain is existent. In essence, per-
sonalized searches [55] not only take few keywords to describe users’ information
needs, but also to consider the users’ individual information, such as behaviors,
communities, locations, browsing histories or glance time. There are many differ-
ent approaches to complete personalized notions. All these approaches have the
motivation of returning the most relevant results after filter irrelevant information
based on individual data. For example, Sieg et al. [98] utilized the user context
to personalize search results by re-ranking the results from a search engine for a
given query.
2.2. Web Personalization 29
2.2.1 Explicit and Implicit Feedback
A fundamental source of user information is relevance feedback. It can be pro-
vided by examined documents or agent actions explicitly or implicitly [57]. Its
goal is to overcome the problem of translating user information need into a
query [57]. A user gives explicit feedback by using one or more ordinal or qual-
itative scales, whereas implicit feedback is estimated by agents according to ob-
servation of a group of interest indicators.
A central issue of explicit feedback is that users have to examine items to
assign them a value on a rating scale or write comments about items. No doubt
that the explicit feedback is more reliable, however, its collection burdens the
users with an additional cognitive load caused by the necessity of evaluating each
information item [117]. As a result, the typical approaches for capturing explicit
feedback are based on either user interaction or observation. Consequently, the
users are required to complete a large set of questionnaires. Even though this
method can generate “perfect” training sets that can accurately reflect user in-
formation needs, it is considered as a pitfall that is the high cost of time, money
and patience. Therefore, lacking description is a vital problem existing from the
collected data.
Implicit feedback [116], on the other hand, is calculated on the base of one or
more implicit interest indicators, which act as surrogate measures for predicting
user interest on a given information item. In contrast, the definition of implicit
feedback, or named pseudo relevance feedback, is the retrieval system attempts
30 Chapter 2. Literature Review
to estimate which items the users may be interested in [35, 96]. Due to the fact
that with many retrieval tasks such as Web searching, the users are not willing
to provide the explicit feedback (clearly relevance feedback) to systems. Conse-
quently, the implicit feedback has became an active area of research, especially
for personalization processing [1]. As known that query expansion and user pro-
filing are two essential techniques for implicit feedback [55, 116]. For this study,
the user profiling is the main domain that we focus on. The production of user
profile can be simply understood by the following way: an information filtering
system firstly records user interaction data such as click through, scrolling fre-
quency, browsing history, and page staying time. Later on , the system infers the
user’s information need through the analysis of these interaction data. A new
query is then composed to re-rank documents. Finally, the re-ranked documents
are presented to the users. A number of experimental results proved that the im-
plicit feedback can infer the user’s information need and indicate user’s interest
intelligently through those implicit data provided in the profile. However, gener-
ating the implicit feedback is either expensive or time-consuming [96]. Also, it is
undesirable when the user’s interest is changeable.
2.2.2 User Information Needs
For user information need acquisition, many efforts have been undertaken to im-
prove the accuracy. Closely related to our work, a user ontology consisting of both
conceptual structure and semantic relations was presented by Jiang and Tan [52]
2.2. Web Personalization 31
in 2009. Their objective of building the ontology was to represent and capture
users’ interests in target domains. Subsequently, a method, they called spread-
ing activation theory, was employed for providing personalized services. Li and
Zhong [66] carried out a term-based ontology learning method for acquiring user
information needs. More recently, Tao et al. [106] developed an ontology-based
knowledge retrieval framework to capture user information needs by considering
user knowledge background and user’s local instance repository with the associ-
ation role and data mining algorithms.
Other works also reflect the importance of user information needs. They treat
user interests as implicit feedback and store in user profiles. Gauch et al. [39]
and Liu et al. [68] learned a user’s profile from her/ his browsing history. Sieg
et al. [98] utilized ontological user profile on the basis of the user’s interaction
with a concept hierarchy which captures domain knowledge. Tao et al. [105]
required users to specify their preferences manually for profiling. In short, the
discussed works aim to enhance search performance through directly asking the
users for explicit feedback or collecting implicit feedback automatically. Few of
them incorporated both the explicit and implicit feedback to assist their filtering
task.
2.2.3 User Profiling
As an “ideal” personalized search system [55], user profiling is required to ac-
complish the implementation of personalization. The profiling includes user data
32 Chapter 2. Literature Review
collection method, profile storage (client-side or server-side), profile construc-
tion, interface, and personalization method. Referring to our proposed research,
ontology modelling is considered as another potential notion to facilitate the per-
sonalized search. Therefore, a series of questions are then raised, such as how to
present the personalized results in ontological format? Is it possible to develop a
special-purpose or customized client application? In order to have the semantic
capability, is it possible to build ontological user profiles? We expect to answer
the questions after understanding the concept of user profiles precisely. This
section reviews the general knowledge of user profile and discusses some related
works that are useful to work out the relative solutions.
Gathering user profile is to better understand implicit user wants in searching.
In other words [134], this objective is related to search intention acquisition. Chau
et al. [20] required direct inputs about interesting phrases from users. However,
due to users expecting minimal involvement during search, the approach was un-
suitable. An approach proposed by Liu et al. [68] in 2004 learned a user’s favorite
categories from her/his search history, and then constructed a user profile based
on the search history. Simultaneously, a general profile was constructed based
on open directory project categories hierarchy. To improve Web search effective-
ness, the above discussed profiles are used to deduce appropriate categories for
each user query. Similar works were also done by Sendhilkumar and Geetha [94],
and Tao et al. [106], but their purposes were different. Especially with Tao et al’s
work, they built the general profile named world knowledge base (based on global
2.2. Web Personalization 33
analysis) and the instances of library collection named local instance repository
(base on local analysis). By reviewing the preview works, we realized that the
user profile can be built with diverse data sources. Its existing is to filter useless
information in accordance to the stored data.
Currently, there are two sorts of information that can be used to build user
profile: explicit or implicit information. In essence, input query is treated as
the explicit information, because it is a partial expression of user information
need [55, 95]. The collection of explicit information data is easy to obtain via
computing term frequency of the input query. On the other side, the implicit
information describes a user’s interest and behavior. In order to capture the
accurate implicit information, it is possible to represent with the minimal user
involvement [98]. This can be done by observing the related data, such as page
visiting frequency, page visiting time, and other actions including adding a page
to bookmark or setting a page to home page etc [55, 98].
An easy approach for constructing a user profile is to describe the profile
through a set of keywords and to require the user to provide the necessary key-
words for more adequate descriptions [10]. The reason that we think it is easy
because it contains many user involvements. In fact, if the user is not familiar
with the system, the user might find difficult to provide the keywords to describe
the preference appropriately. Furthermore, attempting to familiar with all the
vocabularies in documents is time consuming and a heavy burden for the user.
Therefore, a more elaborate alternative is expected to accomplish user profil-
34 Chapter 2. Literature Review
ing, which can not only precisely describe the user preference, but also probably
minimize user involvement. The ontological user profile is a possible solution to
address the previous concern because it is composed by a group of concepts, which
can benefit the preference inference and the study of user background knowledge.
2.3 Knowledge Discovery and Pattern Mining
Knowledge Discovery and Data Mining (KDD) [31] is an interdisciplinary area
focusing upon methodologies for extracting useful knowledge from data. The
ongoing rapid growth of online data due to the Internet and the widespread
use of databases have created an immense need for KDD methodologies. The
challenge of extracting knowledge from data draws upon research in statistics,
databases, pattern recognition, machine learning, data visualization, optimiza-
tion, and high-performance computing, to deliver advanced business intelligence
and web discovery solutions.
Knowledge discovery is the non-trivial process of extracting implicit, previ-
ously unknown, and potentially useful information or patterns in data [36]. Ac-
cording to [30], the knowledge discovery can be formally defined as follows: Given
a set of facts (data) F , and a language L, a pattern is an expression S ∈ L that
describes a subset of the data or a model applicable to the subset. A pattern is
called knowledge if it is interesting to some users or criteria. Hence, the process
of knowledge discovery is to extract interesting patterns from the set of facts in
a database.
2.3. Knowledge Discovery and Pattern Mining 35
2.3.1 Process of Knowledge Discovery
The knowledge discovery process typically involves numerous steps with many
decisions made by the user. As shown in Figure 2.2, the steps of knowledge
discovery may consist of the following: data selection, data preprocessing, data
transformation, pattern discovery and pattern evaluation. Moreover, Knowledge
discovery is “the overall process of discovering useful knowledge from data, while
data mining refers to a particular step in this process. Data mining is the appli-
cation of specific algorithms for extracting patterns from data... The additional
steps in the KDD process, such as data preparation, data selection, data clean-
ing, incorporation of appropriate prior knowledge, and proper interpretation of
the results of mining, are essential to ensure that useful knowledge is derived from
the data. Blind application of data-mining methods can be a dangerous activity,
easily leading to the discovery of meaningless and invalid patterns” [30]. These
steps are briefly described as follows:
• Data Selection: This process includes generating a target dataset and
selecting a dataset or a subset of large data sources where discovery is to be
performed. The input of this process is a database and output is a target
data.
• Preprocessing: This process involves basic operations for data clean-
ing and noise removing. It also includes collecting required information to
model or account for noise, providing appropriate strategies for dealing with
36 Chapter 2. Literature Review
missing data and accounting for redundant data.
• Transformation: The preprocessed data needs to be transformed into a
predefined format, depending on the data mining task. This process needs
to select an adequate type of features to represent data. In addition, feature
selection can be used at this stage for dimension reduction. As the end of
this process,a set of features is recognised as a data set.
• Data Mining: This process involves searching for patterns of interest in a
particular representational form or a set of such representations, including
classification rules, trees, and clustering. The user can aid the data-mining
method by correctly performing the preceding steps.
• Pattern Evaluation: The discovered patterns are evaluated if they are
valid, novel, and potentially useful for the users to meet their information
needs. Only those evaluated to be interesting in some manner are viewed
as useful knowledge. This process should decide whether a pattern is inter-
esting enough to form knowledge in the current context.
2.3.2 Tasks of Data Mining
According to the process of knowledge discovery, data mining methods are per-
formed for extracting patterns from data. These methods can have different goals.
It is possible that several methods that have different goals may be applied to
achieve a desired result. For example, to recommend a product (item) to a partic-
2.3. Knowledge Discovery and Pattern Mining 37
Figure 2.2: The steps of knowledge discovery in databases
ular user, a recommender system might need to group users who have the similar
taste with the user in the customer database, then predict a rate that would be
given to the product according to these users.
Generally, data mining goals fall under the following categories:
• Classification: Classification is the process of assigning data objects to
desired predefined categories or classes. It can be viewed as the process of
finding a proper method to distinguish data classes or concepts. Generally,
training data is required for concept learning before classification can be
proceeded.
• Clustering: Given a set of data objects, clustering is the task of partition-
ing the object set into a finite number of groups such that the objects in
the same group have similar characteristics. In other words, the principle of
clustering is to maximize the intra-class similarity and minimize the inter-
class similarity. The major difference between classification and clustering
38 Chapter 2. Literature Review
is that the latter analyses objects without consulting class labels, whereas
the former need such information in a supervised setting.
• Summarisation: This task is to analysis a set of data objects and de-
scribe their common or characteristic features. Redundant features are also
removed to generate a set of compact patterns, representing the concept of
these objects.
• Association Analysis: Given a set of data objects, the association task is
to find implicit relationships between features (items or attributes) in the
data set with respect to a given criterion. For example, these relations may
be associations between attributes within the data item (intra-patterns) or
associations between different data items (inter-patterns). The investigation
of relationships between items over a period of time is also often referred
to as sequential pattern analysis
2.3.3 Pattern Mining and Association Rules
Pattern discovery is a very useful tool for knowledge discovery tasks. However,
the growing popularity of pattern discovery is often obstructed by returning an
overwhelming number of patterns. The complete collection of patterns affects
not only efficiency performance, but also difficulty to be explored by human
or even machine. This is because they contain non-informative and redundant
patterns. There are several previous attempts proposed to solve the issue of too
large output, but all they fall into the two major approaches. The first approach
2.3. Knowledge Discovery and Pattern Mining 39
is to discover informative patterns with a certain significant measure and the
other but more widely is to eliminate redundant patterns with a given condensed
representation. With this method, large databases can be reduced to only a
few representative data entries. Among such condensed representations, closed
and maximal frequent patterns are two most commonly used for redundancy
reduction of a large collection of patterns. However, while closed patterns offer
the advantage of the recovery of support information, it have found that their
mining result is still too large to be effectively explored. On the other hand, the
mining result of maximal patterns is more concisely. However, they lose support
information. This leads to a recent approach, which balances the tread-off, named
pattern summarization. Due to the huge number of patterns generated by mining
processes, the objective of pattern summarization is to summarize a collection
of itemset patterns using only a few representatives, while a small number of
patterns that a user can handle easily.
With regard to text mining, pattern-based text mining models have recently
emerged as a novel approach for solving knowledge discovery in text. Based on our
survey, several studies have shown encouraging improvement of performance of
text mining tasks such as text classification, document clustering, and information
filtering. Nevertheless, we believe that effective performance of such systems have
been still limited with the number explosion of discovered patterns, which leads to
difficulty in doing the global analysis. To enhance this interpretability, it needs to
focus on a novel approach for concisely representing a large collection of frequent
40 Chapter 2. Literature Review
patterns extracted from text documents.
An association rule are implication statements that uncover interesting asso-
ciations among items in a set of data items. The association rules were originally
proposed by [2] for market basket analysis, searching for interesting relationships
between shoppers and items brought. Currently, association rule mining is one
of the most important and well studied technique for data mining [43].
2.3.4 Association Rules and Frequent Itemsets Mining
Let I = {I1, I2, . . . , Im} be a set of m distinct items (or attributes), T be a
transaction that consists of a set of items such that T ⊆ I, D be a database
with different transactions. An association rule is an implication in the form of
X ⇒ Y , where X, Y ⊂ I are sets of items called itemsets, and X ∩ Y = ∅. X is
called antecedent while Y is called consequent, the rule means X implies Y .
Since a large number of association rules can be generated, two important ba-
sic measures are applied to discover interesting ones:support and confidence. The
support reflects the generality of a rule discovered while the confidence reflects
the certainty or validity of the rule. Given an itemset AB ⊂ I ∈ D, the support
of rule A⇒ B can be calculated as the following equations.
support(A⇒ B) = P (A ∪B) (2.1)
2.3. Knowledge Discovery and Pattern Mining 41
The confidence of the rule can be defined as the following expression.
confidence(A⇒ B) =P (A ∪B)
P (A)(2.2)
Generally, an association rule is interesting if it meets both a pre-defined mini-
mum support threshold and a minimum confidence threshold. Currently, there
are various types of interesting association rules proposed based on different in-
terestingness measures [103].
To mine association rules from large databases, a variety of efficient mining
algorithms was proposed based on two-step approach, where frequent itemsets are
generated in the first step and association rules are generated based on frequent
itemsets in the second one.
2.3.5 Efficient Mining Algorithms
The first well-known efficient algorithm for finding frequent itemsets is Apriori [4].
This algorithm searches for all frequent itemsets with minimum support and
uses an anti-monotonic relation property to improve the efficiency. However, the
Apriori algorithm is computationally intensive due to the overhead of repeatedly
scanning the database for support counting of all candidates.
Another efficient algorithm for mining frequent itemsets is FP-growth [48].
the main advantage of the FP-growth algorithm is to mine all frequent itemsets
without generating any candidate. To achieve this, FP-tree, a tree-like structure,
42 Chapter 2. Literature Review
Table 2.1: An example of vertical format of databaseItem TID List Absolute Support
A 100, 200, 600, 1000 4B 300, 400, 500, 600, 700, 800 6C 200, 400, 500, 800, 900 5D 100, 200, 300, 400, 600, 700, 900, 1000 8E 200, 700, 800 3F 100, 900 2G 800 1
is designated for compressing a database. The FP-growth algorithm requires only
two database scans for finding all frequent itemsets. The first round is to discover
frequent itemsets of size 1 with minimum support and the second one is to sort
the frequent items in all transactions with the descending order of support values
for extracting the FP-tree.
Once the FP-tree was built, frequent itemsets is recursively extracted from
the FP-tree without consulting the original database.
A totally different algorithm for mining frequent itemsets was proposed in [128],
named Eclat. While the two previous algorithms mine frequent itemsets in the
original format of transactional database, Eclat first needs to transpose the orig-
inal database in order to mine frequent patterns with a different format, named
vertical data format. This format regards that an item transaction consists of a
list of transaction-ids. It can be seen in Table 2.1
Once the database was transposed, Eclat exploits the search space by inter-
secting the transaction-id lists between item transactions, where frequent (k +
1)−itemsets are identified by the resulting set of intersecting transaction-id lists
2.3. Knowledge Discovery and Pattern Mining 43
of frequent k−itemsets. For instance, given Supmin = 0.3, bd is obtained by in-
tersecting transaction-id lists of items b and d. Likewise FP-growth, Eclat does
not require database scans while it is easy to determine the support of itemset.
Further to the previous example, if Supmin = 0.3, both items f and g become
infrequent with the count of number of transactions which contain in their TID.
The main drawback of the Eclat algorithm is that it requires the large amount of
memory to fit the large part of database.
2.3.6 Mining Diverse Patterns
Sequential Patterns
There are many applications involving sequence data such as customer shopping
sequences, Web click-streams, text, and biological sequences. With these applica-
tions, databases are represented as sequences of items or events, recorded with or
without timestamps. Sequent pattern mining was first introduced by [3] to dis-
cover sequential patterns or itemsets from such databases. Here, the problem of
sequential pattern mining can be defined as follows: Given a sequence database,
D = {s1, s2, . . . , sn}, the support of a sequence α is the number of sequences in D
which contain α. If the support of a sequence α satisfies a pre-specified min sup
threshold, α is a frequent sequential pattern.
Since many sequence databases often consist of long sequences of items, this
results in quite time-consuming to efficient mine sequential patterns from such
databases. As a result, a variety of mining algorithms have been proposed to
44 Chapter 2. Literature Review
improve the mining efficiency such as GST [101], PrefixSpan [47], FreeSPAN [46],
ApproxMAP [61], and SPADE [129].
Frequent Patterns
Although frequent pattern mining is a very powerful paradigm, a major challenge
in mining frequent patterns from a large data set is the fact that a data mining
process may easily generate many thousands of patterns from a given set of data,
especially support threshold is very low. The large amount of generated patterns
often make it difficult for users to examine them easily. Moreover, many studies
have demonstrated that using all frequent patterns can reduce the performance
of knowledge discovery systems [23,43].
To overcome the challenge of pattern mining, data mining has developed sev-
eral pattern mining techniques to improve both the efficiency and effectiveness
of discovered frequent patterns. Generally, the pattern mining methods can be
divided into two main approaches: 1) mining interesting patterns and 2) mining
approximate patterns.
Interesting and Useful Patterns
Although frequent patterns provide highly detailed information in data, a partic-
ular user tends to be interested in only a small subset of them. A lot of studies in
data mining has contributed to mining interesting patterns or rules to meet user
interests, including constraint-based mining and using interestingness measures.
In constraint-based mining, the patterns that satisfy user-specified constraints
2.3. Knowledge Discovery and Pattern Mining 45
are reported as interesting patterns. A variety of constraints have been proposed
to mine particular patterns. Studies have analysed that these constraints can be
categorized into a finite number of categories according to how they are interact
with the mining process. For example [32],succinct constraints can be pushed
into the initial data selection process, anti-monotone constraints can be pushed
deeply into the mining process to improve the mining efficiency, and monotonic
constraints result in reducing the constraint checking. Currently, new constraints
for mining interesting patterns have been proposed such as pattern set mining as
well as constraints for mining complex patterns [83,135].
A lot of studies were conducted on discovering interesting patterns using in-
terestingness measures. Both objective and subjective measures have been widely
proposed to evaluate the interestingness of discovered patterns [99]. Objective
measures are based on data-driven, which determine the interestingness of a pat-
tern in terms of statistics. For example, both support and confidence are two
well-known measures for finding interesting rules. In [41], lift and χ2 have been
proposed as interesting measures for selecting correlation rules. Recently, the fo-
cus was more on the concepts of information theory to rank and select interesting
patterns such as information gain [23] and maximum entropy models [73,108].
Subjective measures mainly focus on finding interesting or unexpectedness
patterns compared with user’s priori knowledge. For example, in [51] user’s prior
knowledge is expressed by a Bayesian network. The interestingness of an itermset
is defined as the absolute difference between its support estimated from the data
46 Chapter 2. Literature Review
and from the Bayesian network. In [120], user’s interactive feedback can be
considered as user’s prior knowledge to guide the discovery of interesting patterns.
A data-driven approach to select rules for a particular user was proposed in [22].
Approximate Patterns
To reduce the huge set of frequent patterns generated in data mining, studies
have been focusing on mining an approximate set of frequent patterns. Many
approximate patterns try to reduce the redundancy, which can be referred to
pattern summarization in some senses. In general, they can be discussed into two
main categories: pattern compression and representative patterns.
Earlier approaches focus on pattern compression, aiming to eliminate redun-
dant patterns. For example, the concept of closed patterns proposed in [130],
aiming to find lossless compression of frequent patterns. More powerful compres-
sion usually relies on lossy compression using techniques like maximal pattern
mining [11] or clustering-based compression [121]. However, the results of pat-
tern compression are still too large for users to interpret and use them.
Recently, the focus was more on mining k representative patterns for the
whole collection of frequent ones such as top−k most frequent closed patterns
and k− covering set, and k−pattern profiles [121]. As many redundant patterns
generated, the result of top−k patterns are not often representative for the whole
set of patterns. Some studies have been proposed to combine both interestingness
and redundant measures to mine the top−k interesting patterns with very low
2.4. Ontology Learning 47
redundancy. In [60], a number of quality measures have been designated for
finding top−k patterns which regards the redundancy of patterns. Nevertheless,
the challenge issue of mining top−k patterns is how to determine the best value
of k, which is often non-trivial.
2.4 Ontology Learning
The most recent IT approaches suggest the use of systems that allow capture,
structure and retrieval of information through a structured and machine-readable
description of information due to increasing product complexity, globalization,
and visualization requirements. The significance of ontologies is growing in the
fields of knowledge management, information integration, cooperative informa-
tion systems, information retrieval, and electronic commerce [9]. In essence, an
ontology [98, 106] is an explicit specification of concepts and the relationships
between them. It is effective for knowledge sharing and reuse. The following
reviews cover important work of personalized ontology learning, the LCSH world
knowledge base, and the construction of ontology.
2.4.1 Ontology-Based Techniques
Generally speaking, an ontology-based technique is richer, more precise, less am-
biguous than a keyword-based model. It provides an adequate clustering for the
representation of conceptual user interests (e.g. interest for individual instances
such as a sports team, an actor, a stock value) in a hierarchical way, and can be
48 Chapter 2. Literature Review
a key factor to deal with the subtleties of user preferences.
The definition of ontology is a collection of concepts and their interrelation-
ships, which provide an abstract view of an application domain [58,132]. It is an
explicit specification of a conceptualization. Its learning [72] is recognized as a
sub task of information retrieval. The objective is to extract relevant concepts
and relations from a defined corpus or other kinds of data sets to facilitate the
construction of ontologies. One of the main purposes of using ontologies and
data mining technique is to achieve the automatic processes of search, retrieval,
storage, integration, and delivery of information [56].
Over the recent years, researchers have often held the hypothesis that ontology-
based approaches would perform better than the traditional data mining methods
since the ontologies are more discriminative and arguably carry more “semantics”.
As a result, many works were undertaken to take advantage of ontologies for en-
hancing retrieval effectiveness. In 2004, Staab and Studer [102] formally defined
an ontology as a tuple:
(C, R, I, A)
where C denotes a set of concepts; R is a set of relations; I is a set of instances and
A is a set of axioms. Maedche [29, 72] had another definition that differentiates
the relations to hierarchical and plain relations. They also proposed an ontology-
learning framework for the semantic Web. The framework extends typical on-
tology engineering environments by using semi-automatic ontology construction
tools with human intervention, and constructs ontologies adopting the paradigm
2.4. Ontology Learning 49
of balanced cooperative modelling. In Yang and Liu [122] 2009, the application of
ontologies was introduced in order to represent organizational memory metadata.
The kept information refers to the online support of a given client and ontologies
are used to facilitate the search, through information keywords for future use. A
blog has been developed in order to give online customer support. This blog was
a means of facilitating the storage, research, sharing and reuse in companies and
it is also used for its low cost if it is compared to the development of a Web page
defining all the flux required by a client.
Besides, ontology-based technique is also frequently employed to access com-
mon knowledge and to build user profiles. Zhong [132] proposed a learning ap-
proach for task (or domain-specific) ontology, which employs various mining tech-
niques and natural-language understanding methods. Li and Zhong [66] proposed
an automatic ontology learning method, in which a class is called a compound
concept, assembled by primitive classes that are the smallest concepts and cannot
be divided any further. Navigli et al. [79] built an ontology called OntoLearn to
mine the semantic relations among the concepts from Web documents. Gauch
et al. [39] used reference ontology based on the categorization of online portals
and proposed to learn personalized ontology for users. Developed by King et
al. [59], IntelliOnto is built based on the dewey decimal classification system and
attempts to describe the background knowledge. Sieg et al. [98] utilized the onto-
logical user profile on the basis of the user’s interaction with a concept hierarchy
which aims to extract the domain knowledge.
50 Chapter 2. Literature Review
The above ontology learning works cover a constrained number of concepts,
and only use “Is − A” (super-class, or sub-class) relation in the corresponding
backbone of the built ontology. To break through the limitation, the backbone
of the personalized ontology should include more relations to simulate the real
concept model.
More recently, Tao et al. [104] proposed an ontology-based knowledge retrieval
framework, namely ONTO model, to capture user information needs by analysing
general knowledge and local instance repository. However, they disregarded the
vital step of matching and assumed the local information could be ideally referred
to the proper concepts in a global knowledge base.
Of all these related work, the process in [104] seems the most similar to ours
but two differences are: 1) our study adopts relevance feedback as user infor-
mation needs rather than asking users to specify their needs manually, and 2)
provides a sound solution to cope with local information and global knowledge
mismatch.
2.4.2 Concept Mapping
Concept mapping has been formally defined in the domain of ontology engineer-
ing. Its propose is to map a set of concepts into a real (abstract) system view,
which can facilitate sense-making and meaningful learning. People consider use
the mapping for tacit knowledge discovery, ontology merging, semantic enrich-
ment, tag recommendation and et al. The concept mapping is commonly used
2.4. Ontology Learning 51
in scenarios of collaborative ontology learning. For example, the work of Gaines
and Shaw [37] describes a scenario where individual students have to develop con-
cept maps for a specific domain of interest and link them to associated materials.
Peers then assess these maps, modify and enhance them, and provide alternative
versions. While many of the tools and methods developed for concept mapping
might also be applied in the context of ontology development, there are limita-
tions of current approaches. Based on the example, we found that the concept
mapping often is performed as an isolated task, solely focusing on the explication
and negotiation of concepts without being embedded within a purposeful activity.
This might hinder people to see the mediating and dynamic nature of ontologies.
This section also includes the significant work of concept mapping similarity.
The authors expect to determine whether two concepts are relevant while they
occur in a same structure. Some methods were proposed with respect to concept
mapping because the similarity is usually required to specify which two concepts
should be merged.
Kalfoglou and Schorlemmer [54] explored a rational method to measure two
concepts’ similarity in a bidirectional hierarchy. Here, we give a belief explanation
because it is fairly closed to our thought for measuring concepts. The method
first makes sure two concepts C1 and C2, which have the same parent concept
UC(Ci, Hc), are existing in one button up hierarchy. This can be expressed as:
UC(Ci, Hc) := {Cj ∈ C|Hc(Ci, Cj) ∨ Cj = Ci}
Therefore, the similarity between the concepts C1 and C2 can be computed as:
52 Chapter 2. Literature Review
Sim(C1, C2) := |UC(C1,Hc)∩UC(C2,Hc)||UC(C1,Hc)∪UC(C2,Hc)|
Giunchiglia et al. [40] conducted a matching experiment between two large
scale knowledge organization systems (NALT & LCSH) to find significant cor-
respondences. They aim to test a prototype of a concept-based system after
applying their minimal mapping method. However, their work suffers from infor-
mation mismatch caused by the nature of bag of words. Around 50% concepts
are not presented in background knowledge. Wang et al. [114] proposed a method
using extracted features from training documents to represent documents, and
then map concepts in a concept hierarchy. They aim to solve drawbacks suffered
by text classification and feature selection algorithms. However, the method pro-
duces too many terms as candidates to represent training documents (e.g. 7,634
terms for 200 documents). In fact, only a small number of total features may be
heavily relevant, whereas using all the features may affect performance. Naphade
et al. [77] introduced a query to concept matching approach. The approach finds
appropriate concepts with right balance of specificity and generality in WordNet
to matches queries, which aims to break down the semantic space using a small
number of concepts. However, this approach does not provide a scientific evalu-
ation. More importantly, the matched concepts need to be manually judged by
human beings. All the related works discussed first apply the cosine similarity
theory to measure the relevance of two concepts but did not consider applying
relevance features discovered from data mining techniques.
2.4. Ontology Learning 53
2.4.3 Semantic Relations
Semantic relations are an important element in the construction of ontologies and
models of problem domains. The semantic relations of “Is−A” (generalization-
specialization), “Part−Of” (whole-part/aggregation), and “Related− To” (an
association among otherwise unrelated classes) were specified by [104] for a per-
sonalized ontology. In order to enrich the expressive capacity for the new concept
generated within the ontology instead of the previous expressions, Tao et al. [104]
purposed an approach to use specific semantic relations of “Is−A”, “PartOf”,
and “RelatedTo”. It also facilitates the construction to simulate the user’s con-
cept model in response to the user personalized ontologies. The “Is−A” relation
is a direct relation in which one subject is in different from another subject. Its
properties are transitivity and asymmetry. The “PartOf “ relation is a direct
relation used to describe the relationship held by a compound subject class and
its component class. Its properties are transitivity and asymmetry as well. The
“RelatedTo” relation is a non-taxonomic relation describing the relationship held
by two subjects that overlap in their semantic spaces. Its property is symmetry.
Figure 2.3 produced by Tao et al. [104] provides a precise overview of defined
semantic relations:
54 Chapter 2. Literature Review
Figure 2.3: The constructed ontology with the proposed semantic relations
2.4.4 Specificity and Exhaustivity Ontology Mining
For ontology mining, Tao et al. [105] raised an innovative method with two di-
mensions ontology mining, called Specificity and Exhaustivity. This method is
important to our research because it is proposed to achieve the zoomable effect.
Specificity describes the focus of a semantic subject’s meaning on a given topic,
whereas the extent of the semantic meaning covered by a subject corresponding
to the topic is restricted by Exhaustivity. This method [105] aims to analyse the
semantic relations held by the subjects existing in the ontology referring to a
topic, and the following Figure 2.4 is used for illustration:
Taking the topic of sports for an example, Figure 2.4 can help to explain
the internal relationships between the two dimensions. The extent of sports
involves soccer, whereas the US and UK soccer teams both belong to the field
2.4. Ontology Learning 55
Figure 2.4: The concept of Specificity and Exhaustivity
of soccer. Therefore, if the value of Exhaustivity is increasing, the preliminary
extent will become more general. On the other hand, Specificity has the inverse
function of Exhaustivity. With the increased Specificity value, the extent of the
topic will be more focused and specific. The idea is similar to the application
Google map (while zooming in, the location becomes more specific in the map,
the location would inversely be general once zooming out). It can be also referred
to the generalization-based data mining model [45]. The model constructs a set of
sophisticated generalization operators for generalization of complex data objects.
2.4.5 Knowledge Base: LCSH and its Subject Headings
LCSH comprise a thesaurus of subject headings covering subjects or topics in
world knowledge, and are applied to every information item within a library’s
collection. The LCSH classification facilitates a user’s access to items in the
library catalogue that pertain to similar subject matter, and can improve the
efficiency and efficacy of retrieving a useful information item from a library col-
lection [18]. Work on creating LCSH list was started from 1898, and the first
56 Chapter 2. Literature Review
edition was published in 1909. The latest edition (27th) was published in June,
2004, which is also the selected edition using for our research. It has over 270,000
usable headings, and over 490,000 headings that are used for cross referencing.
It had 7,200 new terms added to the lists. Originally, LCSH was designed as
a controlled vocabulary for representing the subject and form of the books and
serials in the Library of Congress (LC)∗ collection, with the purpose of provid-
ing subject access points to the bibliographic records contained in the Library
of Congress catalogues. As an increasing number of other libraries have adopted
the LCSH system, it has become a tool for subject indexing of library catalogues
in general. In recent years, it has also been used as a tool in a number of online
bibliographic databases outside of the LC.
With regard to LCSH specifically, library experts keep arguing its stability
and consistency over four decades. A basic question is whether a new controlled
vocabulary more suited to the requirements of electronic resources should be con-
structed [19]. Some authoritative subcommittee thought over and examined the
options, such as developing a new vocabulary and adopting more existing vo-
cabulary. The second is recommended, which means the use of LCSH normally
with modifications for a general controlled vocabulary covering all subjects. How-
ever, LCSH is still widely used for diverse research purposes. Some of the main
reasons [18] are:
• LCSH is a rich vocabulary using headings to cover all subjects areas , easily
∗http://www.loc.gov/index.html
2.4. Ontology Learning 57
for large general indexing vocabulary in English. It is a pre-coordinate
system that ensures precision in retrieval. More specified subjects can make
the classification more appropriate to present world knowledge.
• To indicate the subject’s relationship precisely, LCSH contains rich semantic
relations among terms.
Whilst we finished investigating and comparing with other library classification
(Library of Congress Classification, Dewey Decimal Classification), LCSH is more
than suitable for our research purposes. It provides the perfect elements for learn-
ing an ontology: the hierarchical system can provide a taxonomic backbone, and
the summarized information items can provide an expert specified information
set [8,71]. Taxonomic backbone refers to the hierarchical structure of the classes
forming an ontology. An ontology based on a taxonomic backbone formed by
subject headings.
Subject headings are a type of controlled vocabulary that is used to make
prediction of searching by using a single term to describe a subject [19]. For
example, the American Revolutionary War can be described using the following
association terms [75]:
58 Chapter 2. Literature Review
American Revolution
American Revolutionary War
Revolution, American
Revolutionary War, American
War of the American Revolution
American War of Independence
War of American Independence
War of Independence
By using a controlled vocabulary, all materials about the American Revolutionary
War are listed under one heading in LCSH:
United States – History – Revolution, 1775-1783
It also creates consistent subdivisions:
United States – History – Revolution, 1775-1783 – Battlefields
United States – History – Civil War, 1861-1865 – Battlefields
World War, 1914-1918 – Battlefields
Using a standardized controlled vocabulary allows searches to be done quickly and
with more accuracy. Also, since most library’s use the same controlled vocabulary
in their catalogs, only one vocabulary style needs to be learned.
In the LCSH, the types of subject heading are grouped by the number of words
that comprise the heading, and also by word order [17]. The following table 2.2 is
from the related works that can help to understand the type of subject headings:
2.4. Ontology Learning 59
Heading type Exampleone-word Animalstwo-word Domestic animalstwo-word, inverted Animals, Fossil
Animals–MortalityAdaptation (Biology)
three+ wordsAND express relationship Computers and college studentsAND combine similar headings Educational tests and measurementsPrep. phrases normal word order Education of princes in literaturePrep. phrases inverted Plants–Effect of pesticides on
Table 2.2: Types of subject headings
Most names (personal, corporate, events, etc.) are not listed in the LCSH. Geo-
graphical names are traditionally inverted to keep the generic terms from being
the main focus [75].
2.4.6 Ontology Construction Language
In order to know how to create a domain ontology, this section introduces a wide
range of knowledge regarding the fundamental ontology languages. At present,
the Web is primarily composed of documents written in semi-structured formats
like HTML (Hyper Text Markup Language) and XML (Extensible Markup Lan-
guage). HTML and XML are useful for visual
The next generation of the Web will combine existing Web technologies with
knowledge representation formalisms [42]. In terms of this statement, we can re-
alize that the information on the Web is not only defined for display purposes, but
also for enhancing the interoperability and integration between systems and appli-
cations. Various new approaches and principles regrading data structure, inter-
60 Chapter 2. Literature Review
operation between learning concepts, and domain modeling in terms of XML,
Resource Description Framework (RDF), and OWL have been developed and im-
plemented due to the progress of semantic Web. The semantic Web is the suitable
method to address the mentioned issues, because it contains the function to ex-
press the explicit meanings for interpretation. Another crucial element required
for the semantic Web is the Web ontology language (OWL), which a popular
World Wide Web Consortium (W3C)† ontology language standard. OWL can
formally describe the semantics of classes, properties and instances in a logic lan-
guage, so that detailed, accurate, consistent, sound, and meaningful ontologies
can be made in a logic language [6]. Therefore, a precise objective of the semantic
Web can be described as making possible the processing of Web information by
computers. Berners-Lee et al. [14] defined that
The Semantic Web is not a separate Web but an extension of the
current one, in which information is given well-defined meaning, better
enabling computer and people to work in cooperation.
For our research, we need to clearly identify the relations between classes and sub-
classes of the specific personalized ontology. A number of the specified relations
are expected to be presented, including the relations of “is-a”, “part-of”, and
“related-to”. In the semantic Web, ontologies play a key role in the consensual
and formal conceptualizations of domains, enabling knowledge sharing and reuse.
In contrast, the semantic Web has sufficient expressions to represent the logic
†http://www.w3.org/
2.4. Ontology Learning 61
relations of the ontologies [81]. In particular, OWL, which is in the family of the
semantic Web released by W3C (World Wide Web Consortium) in 2004, is de-
signed for describing ontologies by application [12]. As we mentioned previously,
the purposes of OWL are not only to present information to human, but also to
process the content of information in computers. The following section analyzes
completely why we adopt OWL as our proposed ontology description language af-
ter comparing it with XML and RDF (Resource Description Framework)/RDFS
(Resource Description Framework Schema).
The aims of this section are: 1) to outline the reasons why the proposed
research adopts OWL to describe the specified ontologies, 2) to compare the
expressive capability with XML and RDF, 3) to understate a range of constructors
and axioms of OWL, and 4) to discuss some related works have been done with
OWL for further study. Some relevant knowledge is also covered in this section
to make sense of construction of OWL.
Resource Description Framework Schema
RDF [74] is a defined language of W3C for describing resources, such as Web
pages. It can be used to make simple assertions about web resources or any other
entity that can be named [81]. A simple assertion is a statement that an entity has
a property with a specific value. For example, a university has an abbreviation
property with the value “QUT”. Due to this limitation, RDF is only used to
present the resources or portion of content in ontology. It is a well-defined syntax
62 Chapter 2. Literature Review
based on DAML+OIL, and some of syntax was extended to RDF Schema (RDFS)
and OWL. Along with the extensions of RDF, RDFS has adequate property and
class, which enables it to build up simple ontologies hierarchically. The main
constructors of RDF Schema are presented in Table 2.3:
Classes and
subclassesrdfs: Class
rdfs: subClassOf
Property rdfs: Domaindeclares the class of the subject in a triple using
this property as predicate
rdfs: rangedeclares the class or datatype of the object in a
triple using this property as predicate
Utility
Propertiesrdfs: seeAlso
indicate a resource that might provide additional
information about the subject resource
rdfs: isDefineBy indicate a resource defining the subject resource
Others rdfs: label
rdfs: comment
rdfs: Literal the class of literal values
rdfs: Datetype the class of datatypes
Table 2.3: RDF schema constructors
The bold contents in this table are need to be acknowledged before using OWL
since there are two properties in OWL, one is object property, the other is data
property. The way to define property in RDF is extremely different from OWL.
Both object property and data property will be discussed in detail later in this
review.
There is another major reason why we do not use RDF in our proposed re-
search. Various limitations of the expressive power are discovered in RDF Schema,
which Antoniou and Van Harmelen [6] outlined, are listed below. In terms of our
2.4. Ontology Learning 63
project, OWL is used to address these limitations gradually:
• Local scope of properties: e.g. cows eat only plants
• Disjointness of classes: e.g. Female is subclass of person
• Boolean combinations of classes: e.g. class person to be the disjoint union
of the classes male and female
• Cardinality restrictions: e.g. exactly, at least
• Special characteristics of properties: e.g. transitive(“greater than”), unique(“is
mother of”) or the inverse of another property(“eats” and “is eaten by”)
As we mentioned in the previous section, OWL is an extension of RDF/RDFS
based on the same syntax and semantic representations. The majority of their
constructions, applications, and implementations are similar. However, OWL is
more powerful on the expressive capability, because it has two kinds of proper-
ties for describing ontologies. This can provide an effective way to present the
ambiguous meanings precisely within ontologies. Firstly, object properties which
relate objects to other objects are able to describe the general logic conceptu-
alization. For example, a student is taught by her/his teacher. There are two
objects in this statement, one is student, and the other is teacher. The relation
or say property is “isTaughtBy”. Secondly, data properties which relate object
datatype values are some values from the objects themselves. For example, a
person has name, age, and gender. The values of them are the data properties of
64 Chapter 2. Literature Review
the object “person”. In this case, through defining various properties based on
these two types, the ontologies can be represented more accurately and precisely.
After detecting the Figure 2.5, the differences between RDF/RDFS and OWL
can be discovered:
Figure 2.5: Subclass relationships between OWL and RDF/RDFS [6]
Web Ontology Language
Baader et al. [9] argued that, without basic research in Description Logics (DLs)
over the last decade, the KR languages could not have played such an important
role in the field of ontology. Because of this, this part of the review treats the
description logics as a preliminary and fundamental aspect before starting to
study OWL. DLs can be used to represent the knowledge of an application domain
in a structured and formal way. Its approach is to utilize a range of employed
constructors and various kinds of specified expressions. For instance, if one want
to define the concept of “A female that has married to a professor has at least
two children.” The description logic is presented as below:
2.4. Ontology Learning 65
Human u Female u ∃married.Professor u (> 2hasChild)
Since DLs provide both the availability of a well-defined semantics and pow-
erful reasoning tools, they should be ideal candidates for ontology languages,
including the newly OWL. This reasoning supports them being used to check
the consistencies and intended logic relations of the ontology are based on DLs,
including RACER and FaCT++ [9]. Throughout the whole proposed research,
it is important to keep the basic concepts of the DLs processing with OWL con-
structors in mind.
Many efforts have also been done by W3C to develop an appropriate OWL,
resulting in the OWL as the current standard [6]. It is noted that the OWL is the
extensible vision of RDF. As the current standard of ontology language, OWL
allows users to express explicit, formal conceptualizations of domain models. An-
toniou & Van Harmelen [6] stated that a number of requirements were satisfied
by OWL, which is listed as follows:
• A well-defined syntax
• A well-defined semantics
• Efficient reasoning support
• Sufficient expressive power
• Convenience of expression
These satisfied requirements demonstrate the outperformed capability and the
benefits of using OWL. The importance of a well-defined syntax is clear, because
66 Chapter 2. Literature Review
this is the precondition for machine-processing of information. OWL is based on
DAML+OIL syntax that is the same syntax of RDF. A well-defined semantics re-
quirement is necessary to describe precisely the meaning of knowledge. Reasoning
support is required by semantics, a number of these supports can be utilized in
OWL, such as RACER and FaCT++. Antoniou & Van Harmelen [6] stated that
the aims of these supports are to check the consistency of the ontology and the
knowledge, to check the relationships between classes, and to classify instances
automatically in classes.
OWL is classified by W3C’s Web Ontology Working Group to fulfill different
requirements [110]. There are totally three species, which are OWL Full, OWL
DL, and OWL Lite respectively. OWL Full is the entire language, and uses all
the OWL language primitives. OWL DL is a sub language of OWL Full, but it
restricts the way in which the constructors from OWL and RDF can be used. The
advantage is that it permits efficient reasoning support. In this case, the specie
of OWL DL enables us to accomplish our proposed research. OWL Lite is a sub
language of OWL DL. Its restrictions are more than OWL DL. However, this one
is easier to implement. The detailed comparisons among these three species are
outlined by Table 2.4:
This section covers many fundamental knowledge about OWL. At the begin-
ning of this section, the review summarizes the context of OWL development
in the semantic Web. Then, as other existing Web resource description lan-
2.5. Summary 67
OWL Full OWL DLOWL
Lite
High Medium Low
Expressive
power
Uses all the OWL languages
primitives. It also allows the
combination of these
primitives in arbitrary ways
with RDF and RDF Schema.
Essentially application of OWL’s
constructor’s to each other is
disallowed, thus ensuring that the
language corresponds to a well
studied description logic.
Reasoning
supportLow Medium
Upward
compatibility
with RDF
Any legal RDF document is
also a legal OWL Full
document; Any valid
RDF/RDF Schema conclusion
is also a valid OWL Full
conclusion.
An RDF document will in general
have to be extended in some ways
and restricted in others. Before it
is a legal OWL DL document.
Every legal OWL DL document is
a legal RDF document.
Table 2.4: Comparison between three species of OWL
guages, XML and RDF/RDFS are analyzed critically. The objective is to find
out which language is suitable for representing personalized ontologies in our re-
search. Through the comprehensive and systematic comparisons, OWL is demon-
strated to be the most appropriate ontology language for this research. At last,
general knowledge of OWL is concluded completely, including language syntax,
description logics, constructors, reasoning supports, three types of specie (OWL
Full, OWL DL, and OWL Lite).
2.5 Summary
This chapter presented several key issues of the conducted literature review. The
literature review introduced the basic techniques of current information retrieval
models. The challenges of information mismatch and overload remained unsolved
68 Chapter 2. Literature Review
by current works. The literature review also pointed out that the key to gather
meaningful and useful information for Web users is to improve the Web informa-
tion gathering techniques from keyword-based to concept-based.
The literature review in this chapter also noted the issues in Web person-
alization, focusing on user profiling, feedback, and user information needs in
personalised Web information gathering. The related work confirmed that the
concept-based models using both user local instances and background knowledge
can help gather useful and meaningful information. However, the representation
and acquisition of user profiles need to be improved along with the effectiveness
of user information need capture.
The literature review presented the essential process and tasks of data mining,
especially the related methods and association rules to mine diverse patterns in
text mining.
Since the significance of ontologies is raising in the fields of IR and data min-
ing community, the literature review covered ontology-based techniques, ontology
learning, and semantic relations. The literature review indicated that ontologies
can provide a basis for the match of user information needs and the existing
concepts and relations. This helps to acquire conceptual user profiles. User back-
ground knowledge can be specified by using personalised ontologies. However, the
existing ontologies and ontology learning methods have limitations and need to
be solved by a matching approach for global concepts and local information. At
last, the OWL was been indicated as the best candidate to encode the ontologies
Chapter 3
Ontology-Based Technique for
Search Intent Mining
This chapter presents an ontology-based approach to discover user search inten-
tions. We first hold the hypothesis that a user search intent should exist some-
where in an ontology. It is severed as user’s information need and is represented by
a specific range of knowledge, which can be discovered via a reasonable method.
In order to avoid extra burdens on the user, we attempt to minimize user involve-
ment so that only a query is required as the input for the proposed approach.
Then, a novel ontology-based mining approach is introduced to extract person-
alized ontologies by identifying adequate concept levels for potential user search
intentions. An iterative mining algorithm is designed for evaluating potential in-
71
72 Chapter 3. Ontology-Based Technique for Search Intent Mining
tents level by level until meeting the best result. Finally, the propose-to-attempt
approach is evaluated in a large volume RCV1 data set, and experimental re-
sults indicate a distinct improvement in top precision after comparison baseline
models.
3.1 Background and Overview of the Approach
For an effective search engine, the retrieval of the desired information for the
user is the primary objective and finding the optimal method for the informa-
tion retrieval has motivated researchers for many years. To improve the existing
search capabilities, a series of advanced algorithms and processes along, with the
solid experimental supports, have been developed [29]. Generally speaking, re-
searchers have directed their attention to the techniques associated with indexing,
matching, and re-ranking procedures [55]. The related contributions are helpful
and useful for enhancing search effectivenesses. However, a significant issue -
ambiguity [65] [68] continues to hinder the discovery of user search intentions.
User search intent is a significant aspect of the user’s motivation to obtain use-
ful information. No doubt it is a crucial expression of user information needs [97].
In terms of recent investigations [104,106,134], Web search intentions can be stud-
ied in two aspects, namely, the specificity and exhaustivity intent. Specificity de-
scribes the focusing extent of the topic, that is the user’s interests have a narrow
and focusing goal; whereas exhaustivity describes a different extent of the topic,
that is the user’s interests have a general/wider scope. However, a hard problem
3.1. Background and Overview of the Approach 73
is how to discover precise user intentions. Is there a feasible approach to main-
taining the two-dimensional intents above? An existing method is quantification,
that is using the relevance weight of a pattern [134]. Another possible solution
is to discover and characterize search intents in a concept hierarchy. Dakka and
Ipeirotis [28] evaluated a system to extract facets from documents, and then pop-
ulate into a hierarchy. In the faceted search, items are first classified into one
or many groups named facets (same as concepts of us), and the user may nar-
row or expand her/his search intent using those facets. The clustering algorithm
uses external online encyclopedias sources of knowledge to identify significant
words/phrases, which can enrich the clustering quality. Moreover, Fontoura et
al. [34] introduced a model of query relaxation along multiple hierarchical tax-
onomies. The authors focus on classifying documents into taxonomy nodes (we
name concepts) and employing the scoring function to implement the matching
work in practice. The related works indicate that the diverse hierarchies are
useful to locate user interesting items.
Similar idea to the closed works, our task is to propose a framework for re-
laxing user search intents over ontologies. The ontologies are more semantic that
that of taxonomies, and are entirely defined by the information need. A number
of levels are partitioned for storing different types of concepts. This information
might be related to other levels by tracking ontological references in a common
knowledge base. The user search intents are assumed in each of the levels. The
higher the level in the hierarchy, the broader extent it has. Conversely, the extents
74 Chapter 3. Ontology-Based Technique for Search Intent Mining
are more specific in the lower levels. The idea is similar to a zooming process.
The proposed approach aims to apply a group of relative concepts to describe
user information need. We believe that interpreting user information needs by a
set of concepts would be more discriminative than a short query.
Figure 3.1: Overview of the ontology-based approach
The general architecture of our ontology-based mining approach is illustrated
in Figure 3.1. To simulate search activities in the real world, users are first
required to translate their implicit search intents into explicit queries. To cope
with the produced queries, a subject-based search model is developed in order
to retrieve matched results from the enormous world knowledge database. The
function of the model is fairly similar to a keyword-based search except the type of
returned results is a list of subject headings in LCSH. For all matched subjects,
3.2. World Knowledge Base 75
a standard pre-processing method is conducted by following the mechanism of
Natural Language Processing (NLP) in IR. The pre-processing mainly consists of
stemming and the removal of stop-words to reduce bias. A particular semantic
extraction method is carried out for all the matched subjects in order to build
personalized ontologies. The reason to build the personalized ontologies is based
on our belief that using just one ontology for all the users is not appropriate,
and that it is better to construct the customized one based on the individual
user’s preferences. After that, all the terms occurring in the subjects are used
to do a query expansion, and then find semantically similar matches rather than
lexically dissimilar results by taking the extracted relations into account. The
related methods for definitions of a world knowledge base, learning a personalized
ontology, and the in-level ontology mining method are each explained in the
following subsections. Eventually, an objective experiment is designed to evaluate
the approach in the standard RCV1 testing set. A number of comparisons are
also made to highlight the precision improvement of search results.
3.2 World Knowledge Base
Understanding a user’s background information is significant for the retrieval
of accurate and personalized search results. Many research efforts have been
spent for gathering and analysing this information. However, current techniques
designed to address the issue are time-consuming and cumbersome, such as user
profiling and relevance feedback. In this section, an alternative is to simplify this
76 Chapter 3. Ontology-Based Technique for Search Intent Mining
process by taking advantage of a comprehensive world knowledge base. World
knowledge is the common-sense knowledge acquired by people based on their
experience and education. It can be considered as an exhaustive repository to
maintain the known knowledge by human beings [27, 127]. Based on a variety
of types of knowledge, we classify all the world knowledge into three categories:
inherent knowledge, implicit knowledge, and explicit knowledge. These specified
relationships among these knowledge types can be observed from Figure 3.2. In
the circle, the implicit knowledge is treated as the user background knowledge
of our proposed approach, which can be acquired through a particular mining
method as later described.
Figure 3.2: Knowledge classification
As illustrated in the diagram, different types of knowledge can be converted
though specified processes. The proposed approach is similar to the process of
inherent knowledge - implicit knowledge - explicit knowledge. Inherent knowl-
edge starts from the world knowledge side. Here, a universal ontology named the
Library of Congress Subject Headings (LCSH), is adopted to store and arrange
the world knowledge in a taxonomic and systematic structure. In addition to
the LCSH ontology, existing universal ontologies have different characteristics to
satisfy individual criteria. Tao et al. in 2007 [104, 105] provided a general com-
parison of the major world knowledge taxonomies applied in previous studies.
3.2. World Knowledge Base 77
Based on their comparative results, the LCSH is determined to be an ideal world
knowledge representation or said the world knowledge base, because it is a rich
vocabulary using headings to cover all subject areas. It indicates the subject’s
relationship precisely and contains more semantic relations among the terms. In
the LCSH, subject headings are basic semantic units for conveying domain knowl-
edge and concepts. They have four main types of references: broader term (BT),
narrower term (NT), used-for Term (UF), and related term (RT). A BT is a
hypernym that is a more general term, for example, “pressure” is a generaliza-
tion of “blood pressure”; a NT is a hyponym that is a more specific term, for
example, “economic crisis” is a specialization of “crisis”. These two references
are used in our model to indicate the is− a relation among subjects in the world
knowledge base. To facilitate the in-level ontology construction later in this ap-
proach, the references are redefined to ancestor and descendant lexical relations,
respectively. ancestor refers to the concept of BT, and descendant refers to NT.
More information about these terms can be found in Table 3.1. All the subjects
are formalized as:
Type Paraphrase ExampleAncestor is the general term for “profession is the general term for scientist”
=⇒ Ancestor(profession, scientist)Descendant is-a “scientist is a profession”
=⇒ Descendant(scientist, profession)
Table 3.1: Examples for redefined relations
Definition 1 (Subject): Let S denote a set of subject headings in LCSH, a
subject s ∈ S is formalized a triple (label, ancestor, descendant), where
78 Chapter 3. Ontology-Based Technique for Search Intent Mining
• label is the heading of s in LCSH thesaurus;
• accestor is a function regarding the subjects that are more general and
located a higher level than s in the world knowledge base;
• descendant is a function regarding the subjects that are more specific and
located a lower level than s in the world knowledge base.
Definition 2 (is− a relation): An is− a semantic relation is formally repre-
sented in the LCSH by the reciprocal references accestor and descendant, which
provides a hierarchic parent-child relationship between two subjects.
Definition 3 (Attribute): Let Pij be the binary predicate, “i is-a j”, hence
we can define the following axioms on P :
• Reflexivity: Pxx;
• Transitivity: Pxy ∧ Pyz ⇒ Pxz;
• Anti-symmetry: Pxy ∧ Pyx ⇒ x = y.
The anti-symmetry implies that said the is−a relation is unidirectional. The
transitivity axiom in Definition 3 is crucial in acknowledging that the part− of
(referred to UF terms before) relation is also hierarchic in nature. Note that
is − a and part − of are similar insofar as they both “involve membership of
individuals in a larger set”, however, “meronymic relationships...are determined
on the basis of characteristics that are extrinsic to the individual members them-
selves”, whereas parent-child relationships are “determined by similarity to other
3.3. Personalized Ontology Learning 79
members based on intrinsic characteristics” [112]. At this stage, there is only one
relation r = (is − a) considered by our approach and the world knowledge base
can be formalized as:
Definition 4 (World Knowledge Base): A world knowledge base ontology is
a directed acyclic graph structure defined as a pair Θ := (S, r), consisting of
• S is a set of subjects in LCSH S = {s1, s2, ..., sn};
• r is the semantic relation r = (is− a) existing among the subjects in S;
3.3 Personalized Ontology Learning
3.3.1 Subject-Based Model
As seen in Figure 3.1, a subject-based search model is developed by the JAVA
programming language to serve the preliminary study. The purpose of this model
is to retrieve relevant subjects and prune irrelevant subjects. When a user en-
ters a query, the model is exactly a mimic of the term-based model that is good
at returning the overlapping subjects to the query. A collection of the LCSH
library from QUT in 2006 is specified as the database for the model. It is im-
portant to be aware that both exactly matched subjects (e.g. subjecti = {t1, t2},
query = {t1, t2}, then subjecti = query) and partially overlapping subjects (e.g.
subjectj = {t2, t3}, subjectj∩query 6= ∅) would be returned simultaneously. This
is because both “AND” and “OR” operators are employed while executing the
model. This process might occur the redundancy issue due to a large amount
80 Chapter 3. Ontology-Based Technique for Search Intent Mining
of overlapping subjects. However, from the IR prospective, it can extend the
scope to cover potential user intents with correlative information. To illustrate
the subject-based model, Figure 3.3 provides a screenshot of the model.
Figure 3.3: The screenshot of subject-based Model
As shown in the figure, we can easily observe that all the matched subjects and
their corresponding semantic relations are displayed on the screen. In the right
corner, a clear number indicates how many satisfactory subjects are gathered
from the database. Along with the subjects, there might exist various semantics
that can be utilized to point out the correlations with other subjects in the LCSH
database. To recognize the special representation of the embedded semantics, a
suitable method is designed for extraction and future uses.
3.3.2 Semantic Extraction
Extracting semantic relations is a preliminary step before constructing a domain
ontology. Without the extraction, it is difficult to understand the logical rela-
tions among subjects or other instances [14]. It is also important to convert
3.3. Personalized Ontology Learning 81
information in such a way that it is readable and perspicuous by both humans
and computers [58]. This section presents an overview of the method to obtain
the desired semantics and to deal with their unique and original representations
defined by the LCSH database.
The extraction occurs only a subject having the is − a relation as redefined
at Section 3.2 . All the references of ancestor and descendent will be taken into
account for ontology learning since the is−a relation is based on these references.
For each subject, the assigned unique ID is used to search and its references are
basically stored by a string format in the LCSH database, like the example in
Figure 3.4. Except the ancestor references, the subject “economic espionage”
has no other references.
1: Total matched subject(s): 1593
2: User selected: Economic Espionage
3: Ancestors: T19[T18087[T115130[T81991[T69623[T64842[T98712,],],],],],T102353,],
4: Descendants:
5: UsedFor:
6: RelatedTo:
Figure 3.4: The extracted semantic relations
All the references specified in the world knowledge base are beneficial in our
approach to acquire a set of new subjects in other levels. All the levels are defined
as two directions: one is upper, and the other is lower. The pilot level is selected
as a benchmark in the hierarchy. When nodes appear in a higher level than the
pilot, they are located in upper levels. In contrast, nodes are in lower levels if they
82 Chapter 3. Ontology-Based Technique for Search Intent Mining
appear lower than the pilot. Some subjects might just occur one time in the pilot
because there are no semantic relations. Meanwhile, some of them might merely
have related subjects either in upper or lower levels due to they just hold one
type of reference (ancestor or descendant). However, except the subjects in the
pilot, all the subjects in other levels would definitely have at least one semantic
relation. Taking Figure 3.5 for example, “economic espionage” is a subject in the
pilot level. It has three accestor references, but no descendant. In this case, the
subject can only return three more general subjects in the upper level. Later on
other subjects can be found in the upper levels based on references from these
three subjects.
Figure 3.5: A Segment of the Subject Headings
3.3.3 In-Level Ontology Mining Method
The concept hierarchy is an essential subtask of ontology learning [90]. In the-
ory, it is a prerequisite hierarchy where an amount of nodes represent concepts
3.3. Personalized Ontology Learning 83
in a domain, and related links served as prerequisite relationships. For this ap-
proach, we create a special hierarchy format to satisfy our research purposes. The
hierarchical backbone is drawn as Figure 3.6. The step is similar as the knowl-
edge diagram depicted in Figure 3.2, which is to mine the implicit knowledge
for the explicit knowledge. One of the objectives is to take advantage of this
hierarchy to structure information into categories, thus enhancing the effective-
ness and reusability of the subject-based search model. In addition, to acquire
implicit knowledge by tracking internal relationships among concepts is another
objective. Since we believe user search intentions are somewhere in the world
knowledge base, the concept hierarchy is pre-defined into several abstract levels.
The gathered implicit knowledge will be used to estimate whether the user search
intention appears in a certain level.
Figure 3.6: The backbone of in-levels hierarchy
At the beginning, the matched results are a set of subjects that can be re-
84 Chapter 3. Ontology-Based Technique for Search Intent Mining
trieved after implementing the subject-based search model. Each of the subjects
is realized as a concept here, which is represented by a single node in Figure 3.6.
To learn a personalized ontology, a domain’s scope needs to be confirmed by tak-
ing these subjects into account because all the subjects within this domain are
identified to discuss the correlative information regarding a entered topic. They
are all in the same abstract level, originally defined as the “pilot level”. Initially,
a formalized definition for the domain is provided as:
Definition 5 (Domain for a Level): Let Ci denote a set of subjects Ci :=
{s1, s2, ..., sh} in a level li. We define its domain by dom(l)i := dom(s1)∪dom(s2)∪
... ∪ dom(sh), where dom(sh) are all the terms involved into the subject sh.
Dashed circulars in Figure 3.6 are utilized to indicate the domains of different
levels. With respect to the ontology learning, we also formalize:
Definition 6 (Personalized Ontology): the personalized ontology for a topic
in a 4-tuple Θp := (C, L,DOM, e), where
• C is a super set of C including all subjects in levels C = {s1, s2, ...sh};
• L is a set of levels consisting of a domain and subjects L = {l1, l2, ..., li},
l1 ⊇ sh;
• DOM contains DOM := (dom(l1), dom(l2), ..., dom(le));
• e is the number of levels.
The major in-levels concept is also related to a knowledge generalization pro-
cess [44]. In our thinking, a subject in an upper level covers more general knowl-
3.3. Personalized Ontology Learning 85
edge than a subject in the lower one. For instance, in Figure 3.5, “economic
espionage” is in a lower level, and it has a “is − a” semantic relation associ-
ated with the subject “business ethics” is the upper level. “business ethics” is a
broader knowledge that covers the knowledge of “economic espionage”, and can
be presented in a statement of “economic espionage is a business ethics”. In other
words, the knowledge in a level can be summarized by the knowledge in the next
upper level. Eventually, all the knowledge in the world knowledge base will be
summarized in philosophy. This is a main reason why the domains of the upper
levels are getting smaller when moving towards to the peak of the backbone in
Figure 3.6, which looks like the shape of a cone. However, the question arises
about why this happens the same why in the lower levels. From the perspective
of IR, the subject-based search model uses to return specific subjects based on
keywords. The majority of the subjects lack the semantic relations to be associ-
ated with the more specific knowledge. As a result, the number of subjects in the
lower levels is decreasing as well as their domains. Therefore, the shape becomes
an inverse cone. It is noted that the backbone structure is not a formal tree, and
a node can have more than one parent or child.
Before proving the discussed hypothesis, an iterative ontology mining method
is proposed in this section. It starts from the pilot level, and then builds a
personalized ontology (the backbone of the in-level hierarchy) in order to find a
suitable level for the search intent. The building process simply uses the is − a
relation to find all the parents for an upper level or to get all the children for
86 Chapter 3. Ontology-Based Technique for Search Intent Mining
a lower level. For precise understanding, a study example is separated into two
phases to explain the method in detail. Each phase involves several steps.
Using feature for level representation
There are two main objectives: 1) to decide subjects and weights for the pilot
level lρ; and 2) to represent it as a query Qρ.
Firstly, a number of matched subjects are retrieved for the pilot level lρ after
implementing the subject-based search model. Then, a weight is calculated for
all the subjects s ∈ Cρ by using the following formula:
weight(s) = |q∩s||s|
where |q ∩ s| denotes the term that appeared in both query q and subject s, |s|
denotes the total number of terms in the subjects. Therefore, a set of subject
weight pairs is obtained S(w) = {< s1, w1 >,< s2, w2 >, ..., < sn, wn >};
Secondly, the query is expanded to a set of terms by union of all the terms
from the submitted query and matched subjects, and letting Qρ = {t1, t2, ..., tm}.
For example, the submitted query is query = {t1, t2}, and the other subjects
are s1 = {t1, t2}, s2 = {t1, t2, t6}, and s3 = {t1, t2, t5, t8}. After that, Qρ =
query ∪ s1 ∪ s2 ∪ s3 = {t1, t2, t5, t6, t8};
Thirdly, term weight pairs are calculated for all terms t ∈ Qρ via using the
following formula:
weight(t) =∑
t∈s,s∈Cρ
weight(s)|s|
3.3. Personalized Ontology Learning 87
Then, we receive a set of term weight pairs as a feature Fρ = {< t1, w1 >,<
t2, w2 >, ..., < tm, wm >} to represent this level.
Determining the best level for user search intent
The goal is to determine the suitable level for characterizing the user search intent
according to a training set.
Let Dt stand for a set of documents in the training set (the approach uses
the RCV1 training set), t denotes a certain topic of the documents. All these
documents have been initialized with a value of either 0 or 1 by linguists. These
two values are used for indicating whether a document is relevant to the topic.
0 means the document is irrelevant, whereas 1 means the document is relevant.
We calculate a weight for each document in the training set by using the feature
of a level. Thus, rank Dt by using Algorithm 1 provided as follows:
Algorithm 1 Ranking Algorithm based on a levelInput:
The set of documents Dt in the training set of RCV1;level i.
Output:Sort Dt
1: for each d ∈ Dt do2: //initialize a document d with a weight 0;3: rank(d) = 0;4: for each t ∈ Qi do5: if t ∈ d then6: rank(d) = rank(d) + rank(t);7: end if8: end for9: end for
10: return Sort Dt based on the rank function;
88 Chapter 3. Ontology-Based Technique for Search Intent Mining
Step one, a top-K precision precision(lρ) of the ranked documents based on
the specified values is calculated by applying the formula below:
precision(lρ) =
K∑i=1
f(di)
K
where f(di) = 1 if relevant, otherwise f(di) = 0;
Step two, shift to the upper level lρ+1 in the hierarchy. Thereby, a new set of
subjects in lρ+1 is returned by getting all subjects s that have a is−a relationship
with any subjects in the pilot level lρ. Repeat the same step in Phase 1 to rank
the documents Dt+1 by using the feature Fρ+1 for level lρ+1, and then calculate
the top-K precision on precision(lρ+1);
In the last step, make a comparison of the top K precision in the pilot level
precision(lρ) and the upper level precision(lρ+1). If precision(lρ) > precision(lρ+1),
return lρ as a level of search intents. Otherwise, go to step two and implement
the same procedure in lρ+2 again.
Phase 2 is kept looping for upper levels until the satisfactory level is met based
on precision performance. The entire process is presented in Algorithm 2, where
parameter µ is used to control the distance between the selected level li with the
pilot level lρ. If a level is too far away from the pilot level, we assume that it is less
significant to search intentions. Hence, we set the parameter here as a threshold
to prove our assumption. To save space, the approach omits the explanation for
the lower levels because its algorithm is quite similar to Algorithm 2.
According to the two phases outlined above, we are able to gain a level with
the best top-K precision among all the hierarchical levels. This level is considered
3.4. Summary 89
Algorithm 2 for Phase 2 (consider go to upper levels only)
Input:The set of documents Dt in the training set of RCV1;Parameter µ.
Output:A level of search intents
1: Let j = ρ, i = j;2: i = i+ 1; //Shift to the upper level;3: Get Qi and Fi;4: Use Fi to rank Dt;5: if precision(li) < precision(lj) then6: return j;7: else8: if i− ρ > µ then9: return i
10: end if11: end if12: j = i;13: Go to 2;
as the output of user search intents from our proposed approach.
Through taking the approach above, a personalized ontology can be gathered
successfully in terms of the definition specified in 3.3.3. A set of concepts is
obtained from the global LCSH ontology. The concepts have been partitioned
by a clear domain and the abstract level. As the level selection is based on
the precision measurement, the completed personalized ontology is expected to
enhance the top-k performance in retrieval evaluation.
3.4 Summary
In this chapter, a novel ontology-based mining approach is introduced for the
discovery of user search intents. The discovery is about personalized ontology ex-
90 Chapter 3. Ontology-Based Technique for Search Intent Mining
traction and its purpose is to describe characterized user information needs. The
entire approach is supported by knowledge generalization and representation the-
ories. A two dimensional mining method is first developed based on the theory
of specificity and exhaustivity in search intent. To effectively filter out irrelevant
information, a subject-based search model is encoded and fully functional. It can
also allocate matched results into a world knowledge base, namely the LCSH.
The method to deal with semantic extraction is outlined for ontology learning.
A concept-based hierarchy is built by applying extracted semantic relations from
the world knowledge base, which enable the possible user intents to be charac-
terized in a virtual level. Eventually, the personalized ontology can be obtained
to facilitate the retrieval with the aim to achieve high top-k precision. For the
evaluation, a large RCV1 testing bed is employed for various measurements. A
benchmark and baseline models are carefully decided for comparison.
Chapter 4
Relevance Feature Matching
In Chapter 1, one crucial question was how to reasonably match local informa-
tion with global knowledge. To answer the question, this chapter introduces a
sound approach to match the relevance feature extracted from local documents
to ontological concepts existing in a world knowledge base. The proposed ap-
proach incorporates both pattern mining and ontology learning techniques. In
regard to the mining prospective, the chapter begins by describing the method
of the relevance feature acquisition, and then outlines why the feature is the best
candidate for local information. Regarding the ontology learning, a creative dis-
tributed matching is developed by following the theory of conceptual intension
and extension. It can successfully accomplish two expected tasks. One task is 1)
to ensure that the core feature can be firstly converted into high level concepts,
91
92 Chapter 4. Relevance Feature Matching
and the other task is 2) to obtain potential relevant concepts for the core feature.
The entire approach will help to interpret user information needs by personalized
ontologies. It is extremely significant as those filtering systems aim to achieve
personalization functions through studying user specific information.
4.1 Design and Definitions
Comparing with query-based process, relevance feature-based process is more
complicate and rely on text mining technique. First of all, the relevance features
are mined from a set of user local documents. These documents implicitly reflect
user personal interests and preferences. It is worth to take advantage of the
context among the documents for better understanding and refining user needs.
Figure 4.1 provides an overview of the proposed approach. Local information
is the input and consists of two parts, namely: relevance feedback and a set of
training documents. The relevance feedback is initialized as positive feedback
and negative feedback. The positive feedback indicates the document is relevant
to the user’s interests, whereas the negative feedback indicates the document
is irrelevant. By applying the method discussed in Section 4.2, the extracted
relevance features can be categorized into three groups: positive specific group
(SPE), general group, and positive negative group. Here, our approach only
selects terms in the SPE group as candidates because they contain more topic-
related interests rather than other groups [65], which was demonstrated in the
preliminary study also reported in this chapter (Section 4.3.1). The LCSH, which
4.1. Design and Definitions 93
is a large thesaurus of subject headings, is chosen to be the representation of
global knowledge. Each subject heading denotes a concept in the knowledge
base. Normally, the concept is a short phrase which contains one or several
terms. We match the SPE terms to the subject headings, and the learning process
of personalized ontologies in 3 can eventually take advantage of the matched
concepts and their enclosed relations in the LCSH.
Figure 4.1: Overview of the relevance feature matching approach
The design of the approach can be understood in two separate sections. One is
drawn at the left portion of Figure 4.1. It is based on data mining (more precisely
adopting the pattern mining technique) to obtain local information. The local
information is recognized as a set of terms after deploying a set of patterns existing
in RCV1 training documents. The positive feedback plays an important role for
mining the closed sequential patterns to represent the documents. The other
direction is related to ontology learning. It is talking about concepts, which are
realized the same as the subjects in the LCSH. To build a relationship between
the two directions, a distributed matching is guided by the theory of concept
94 Chapter 4. Relevance Feature Matching
extension and intension to define the coverage of the matched subjects. Once the
matched subjects are confirmed, the major components including the concepts
of the personalized ontologies and their interrelationships can be determined for
construction.
4.1.1 Definitions of Patterns and Closed Patterns
The purpose of relevance feature discovery is to find useful features, including
patterns, terms and their weights, in a training set D, which consists of a set of
positive documents, D+, and a set of negative documents, D−. In this study, we
assume that all documents are split into paragraphs, so a given document d yields
a set of paragraphs PS(d). To clearly understand the concepts of patterns, we in-
troduce normal patterns and closed patterns first, and then we discuss sequential
closed patterns. These definitions can be found in [118] or [67].
Let T = {t1, t2, . . . , tm} be a set of terms which are extracted from D+. Given
a termset X, a set of terms, in document d, coverset(X) = {dp|dp ∈ PS(d), X ⊆
dp}. Its absolute support
supa(X) = |coverset(X)|;
and its relative support
supr(X) =|coverset(X)||PS(d)|
.
4.1. Design and Definitions 95
A termset X is called frequent pattern if its supa (or supr) ≥ min sup, a minimum
support. Given a set of paragraphs Y ⊆ PS(d), we can define its termset, which
satisfies
termset(Y ) = {t|∀dp ∈ Y ⇒ t ∈ dp}.
Let Cls(X) = termset(coverset(X)) be the closure of X. We call X closed if
and only if X = Cls(X). Let X be a closed pattern. We have
supa(X1) < supa(X) (4.1)
for all pattern X1 ⊃ X.
Patterns can be structured into a taxonomy by using the is-a (or subset) rela-
tion and closed patterns. Put simply, a pattern taxonomy is described as a set of
patterns, and the relation in the taxonomy is the subset relation. Smaller patterns
in the taxonomy are usually more general because they could be used frequently
in both positive and negative documents; but larger patterns in the taxonomy
are usually more specific since they may be used only in positive documents.
Closed Sequential Patterns
A sequential pattern s =< t1, . . . , tr > (ti ∈ T ) is an ordered list of terms.
A sequence s1 =< x1, . . . , xi > is a sub-sequence of another sequence s2 =<
y1, . . . , yj >, denoted by s1 v s2, iff ∃j1, . . . , ji such that 1 ≤ j1 < j2 . . . < ji ≤ j
and x1 = yj1 , x2 = yj2 , . . . , xi = yji . Given s1 v s2, we usually say s1 is a
96 Chapter 4. Relevance Feature Matching
sub-pattern of s2, and s2 is a super-pattern of s1. In the following, we refer to
sequential patterns as patterns.
Given a sequential pattern X in document d, coverset(X) is still used to
denote the covering set of X, which includes all paragraphs ps ∈ PS(d) such that
X v ps, i.e., coverset(X) = {ps|ps ∈ PS(d), X v ps}. Its absolute support and
relative support are defined as the same as for the normal patterns.
A sequential pattern X is called a frequent pattern if its relative support
≥ min sup, a minimum support. The property of closed patterns (see Equa-
tion (4.1)) can be used to define closed sequential patterns. A frequent sequen-
tial pattern X is called closed if not ∃ any super-pattern X1 of X such that
supa(X1) = supa(X).
4.1.2 Global Knowledge Base: the LCSH
Global knowledge is the common-sense knowledge acquired by people based on
their experience and education. The LCSH is an ideal global knowledge repre-
sentation because a rich vocabulary is used to cover all subject areas. In the
LCSH, subject headings are basic elements to convey knowledge in the format
of concepts, where they have three main types of references: broader term (BT),
narrower term (NT) and related term (RT). The related definitions are clarified
as follows.
Definition of subject headings: Let C denote a set of subject headings in the
LCSH, a subject s ∈ C is formalized as a pair (label, reference), where
4.2. Relevance Feature Acquisition 97
• label is the heading of s in the LCSH thesaurus;
• reference is a statement regarding all references that the subject c has.
The LCSH world knowledge base can be formalized as:
Definition of world knowledge base: A world knowledge base ontology is in a
directed acyclic graph structure defined as a pair Θ := (C,R), where
• C is a set of subjects in LCSH C = {s1, s2, ..., sn};
• R is the semantic relations R = {ref1, ref2, ..., refn} existing among the
subjects in C.
4.2 Relevance Feature Acquisition
In general, the concept of relevance is subjective. People can easily determine
the relevance of a topic (or document) in specificity or generality. However,
it is difficult to use these concepts for interpreting relevance features in text
documents. This section first discusses the use of the concepts for understanding
the different roles of the low-level feature terms for answering what users want.
We also present the ideas for accurately weighting terms based on their specificity
and distributions in the discovered higher level features. Finally, we describe
algorithms for both the discovery of higher level features and the revision of
weights of low-level terms.
98 Chapter 4. Relevance Feature Matching
4.2.1 Specificity of Low-Level Features
A term’s specificity describes the extent to which the topic of the term focuses on
what users want. It is very difficult to measure the specificity of terms because a
term’s specificity depends on users’ perspectives on their information needs [104].
In this study, the terms are grouped into three groups (SPE, general, and nega-
tive specific terms) based on their appearances in a training set. Given a term
t ∈ T , its coverage+ is the set of positive documents that contains t, and its
coverage− is the set of negative documents that contains t. We assume that
terms frequently used in both positive documents and negative documents are
general terms. Therefore, we want to classify terms that are more frequently
used in the positive documents in the positive specific category; and the terms
that are more frequently used in the negative documents in the negative specific
category. Based on the above analysis, we define the specificity of a given term t
in the training set D = D+ ∪D− as follows:
spe(t) =|coverage+(t)| − |coverage−(t)|
n
where coverage+(t) = {d ∈ D+|t ∈ d}, coverage−(t) = {d ∈ D−|t ∈ d}, and
n = |D+|. spe(t) > 0 means that term t is used more frequently in positive
documents than in negative documents. We present the following classification
rules for determining the general terms G, the SPE terms T+, and the negative
4.2. Relevance Feature Acquisition 99
specific terms T−:
G = {t ∈ T |θ1 ≤ spe(t) ≤ θ2},
T+ = {t ∈ T |spe(t) > θ2}, and
T− = {t ∈ T |spe(t) < θ1}.
where θ2 is an experimental coefficient, the maximum boundary of the specificity
for the general terms, and θ1 is also an experimental coefficient, the minimum
boundary of the specificity for the general terms. We assume that θ2 > 0 and
θ2 ≥ θ1. It is easy to verify that G ∩ T+ ∩ T− = ∅. Therefore, {G, T+, T−} is
a partition of all terms. To describe the relevance features for a given topic, we
believe that specific terms are very useful for the topic in order to distinguish it
from other topics.
Basically, we can understand the specificity of terms based on their positions in
a concept hierarchy. For example, terms are more general if they are in the upper
part of the LCSH hierarchy; otherwise, they are more specific. However, in many
cases, a term’s specificity is measured based on the topics we are talking about.
For example, “knowledge discovery” will be a general term in the data mining
community; however it may be a specific term when we talk about information
technology.
100 Chapter 4. Relevance Feature Matching
4.2.2 Term Weighting
This section develops equations for deploying patterns on low-level terms by
evaluating term supports based on their appearances in patterns. The evalu-
ation of term supports (weights) is different from term-based approaches. For
a term-based approach, the evaluation of a given term’s weight is based on its
appearances in documents. However, here, terms are weighted according to their
appearances in discovered patterns.
In the pattern taxonomy Model (PTM), relevance features are discovered from
a set of positive documents. To improve the efficiency of the PTM, an algorithm,
SPMining(D+,min sup) [119], was proposed (also used in [118]) to find closed
sequential patterns for all documents ∈ D+, which used the well-known Apriori
property in order to reduce the searching space. For all positive documents
di ∈ D+, the SPMining algorithm can discover all closed sequential patterns,
SPi, based on a given min sup. (We omit this algorithm to save space.)
Let SP1, SP2, ..., SPn be the sets of discovered closed sequential patterns for
all documents di ∈ D+(i = 1, · · · , n), where n = |D+|. For a given term t, its
support (or called weight) in discovered patterns can be described as follows:
support(t,D+) =n∑i=1
|{p|p ∈ SPi, t ∈ p}|∑p∈SPi |p|
(4.2)
where |p| is the number of terms in p.
Table 4.1 illustrates an example of sets of discovered closed sequential pat-
4.2. Relevance Feature Acquisition 101
terns for D+ = {d1, d2, · · · , d5}. For example, the term global appears in three
documents (d2, d3 and d5). Therefore, its support is evaluated based on patterns
in the sets of closed sequential patterns that contain global:
support(global,D+) =2
4+
1
3+
1
3=
7
6.
Table 4.1: Example of pattern miningDoc. Discovered Closed Sequential Patterns (SPi)
d1 {〈carbon〉 , 〈carbon, emiss〉, 〈air, pollut〉 }d2 {〈greenhous, global〉, 〈emiss, global〉}d3 {〈greenhous〉, 〈global, emiss〉}d4 {〈carbon〉, 〈air〉, 〈air, antarct〉}d5 {〈emiss, global, pollut〉}
After the supports of the terms have been computed from the training set,
the following rank will be assigned to every incoming document d to decide its
relevance:
rank(d) =∑t∈T
weight(t)τ(t, d) (4.3)
where weight(t) = support(t,D+); and τ(t, d) = 1 if t ∈ d; otherwise τ(t, d) = 0.
Because of significant levels of noise in the discovered patterns (an inherent
disadvantage of data mining), the evaluated supports are not accurate enough.
To improve the effectiveness of PTM, we use negative documents in the training
set in order to remove the noises. If a document’s rank (see Equation. 4.3)
is less than or equals to zero, this document is clearly negative to the system.
102 Chapter 4. Relevance Feature Matching
If a negative document has a high ranking position, the document is called an
offender [66] because it forces the system to make a mistake. The offenders are
normally defined as the top-K negative documents in a ranked set of negative
documents, D−. The basic hypothesis is that the relevance features should be
mainly discovered from the positive documents. Therefore, in our experiments,
we set K = n2, the half of the number of positive documents.
There are two major issues for effectively using negative documents. The first
is : 1) how to select a suitable set of negative documents because we usually can
obtain a very large set of negative samples. For example, a Google search can
return millions of documents; however, only a few documents are interesting to a
Web user. Obviously, it is not efficient to use all of the negative documents. The
second issue is 2) how to accurately revise the features discovered in the positive
documents.
Many people believe that negative documents can be helpful if they are used
appropriately. The existing methods can be grouped into two approaches: revis-
ing terms that appear in both positive and negative documents; and observing
how often terms appear in positive and negative documents. However, how much
improvement in accuracy can be achieved by using negative feedback still remains
an open question.
In this section, we present an innovative solution for these issues. We show
how to select a set of negative samples. We also show the process of the revision.
Once we select the top-K negative documents, the set of negative document
4.2. Relevance Feature Acquisition 103
D− will be reduced to include only K offenders (negative documents). The next
step is to classify terms into three categories, namely, G, T+, and T−, based on
D+ and the updated D−. We can easily verify that the experimental coefficients
θ1 and θ2 satisfy the following properties if K = n2:
0 ≤ θ2 ≤ 1, and − 1
2≤ θ1 ≤ θ2.
Here, we show the basic process of revising discovered features in a training
set. This process can help readers to understand the proposed strategies for
revising the weights of low-level terms in different categories. Formally, let DP+
be the union of all discovered closed sequential patterns in D+, DP− be the union
of all discovered closed sequential patterns in D− and T be the set of terms that
appear in DP+ or DP−, where a closed sequential pattern of D+ (or D−) is called
a positive pattern (or negative pattern).
It is obvious that ∃d ∈ D+ such that t ∈ d for all t ∈ T+ since spe(t) > θ2 ≥ 0
for all t ∈ T+. Therefore, for each t ∈ T+, it can obtain an initial weight by the
deploying method on D+ (using the higher level features, see Equation. 4.2).
For the term in (T− ∪G), there are two cases. If ∃d ∈ D+ such that t ∈ d, t
will get its initial weight by using the deploying method on D+; otherwise it will
get a negative weight by using the deploying methods on D−.
The initial weights of terms finally are revised according to the following
principles: increment the weights of the SPE terms, decline the weights of the
negative specific terms, and do not update the weights of the general terms. The
104 Chapter 4. Relevance Feature Matching
details are described as follows:
weight(t) =
w(t) + w(t)× spe(t), if t ∈ T+
w(t), if t ∈ G
w(t)− |w(t)× spe(t)|, if t ∈ T−
(4.4)
where w is the initial weight(or the support in Equation. 4.2).
4.3 Concept Matching Method
4.3.1 Limitations of Query-Based Method
Before introducing our new matching method, this section first discusses the basic
theory of query-based matching and its associated disadvantages. It is important
to be aware that the disadvantages lead to a serious problem when we attempt
to incorporate data mining and ontology learning techniques.
Generally speaking, query-based matching happens along with two sets of
terms. The matching is based on the degree of their overlapping. In this sce-
nario, for example, a relevance feature extracted from the previous pattern min-
ing approach is recognized as a query, where q = {t1, t2, t3, t4, t5}. One subject
in the LCSH represents a concept, where s = {t0, t3, t5, t6}. The traditional
method [50, 88] would primarily consider the number of overlapping terms as
|q ∩ s| = 2. The larger the number of overlapping terms, the more optimal the
matching. From the data mining point of view, the theory is correct and has
4.3. Concept Matching Method 105
no argument. This is because of the fact that if a term frequently appears in
the text, it must be a specific and highly relevant term of a topic. This term is
used to describe significant information needs. The query-based matching offers
a reasonable way to estimate the similarity between queries and subjects. How-
ever, it is not good enough due to the disadvantage of redundancy. According to
the output from a preliminary study, we realize that the general terms defined in
the previous section appear frequently in the LCSH database. Consequently, the
retrieved subjects must contain a lot of useless terms to decrease effectiveness.
Figure 4.2: Occurrence of terms in the first 50 RCV1 topics
The preliminary study was carried out to investigate the occurrence of the
SPE and general terms in the LCSH. The study applied the first 50 RCV1 topics
for testing. Each term in the two groups was assigned a value computed by:
SPEonto(t) = 1|coverage(t)|
where |coverage(t)| was the number of subjects containing t. To allocate all the
values between 0 and 1, we applied a normalization method as:
SPEonto(t) = log10(N
|coverage(t)|)/log10(N
MAX)
106 Chapter 4. Relevance Feature Matching
where N = total number of subjects, MAX = the maximum of coverage(t)
for all t ∈ T+orG . As shown in Figure 4.2, the “ONTO-SPE” curve is generated
by the results of SPEonto(t), t ∈ T+ whereas the “ONTO-G” is from the results
of SPEonto(t′), t′ ∈ G. It was apparently that the terms belonging to the general
group occurred frequently in the majority of the topics. This causes a possible
consequence that the matched concepts were not only irrelevant, but also regular
rather than specific.
4.3.2 Concept Intension & Extension
In the study of formal concepts, every concept is understood as a unit of thoughts
that consists of two parts, namely, the intension and extension of the concept [124].
A concept can be described jointly by its intension and extension.
In order to find accurate concepts from the LCSH for building personalized
ontologies, this section presents a creative strategy that is applied to supervise
concept acquisition prior to matching. The strategy takes concept intension and
extension into account to theoretically support the proposed matching. With
different objectives, there are diverse explanations of the intension and extension.
Here, we understand it in the logic way [84] as: correlative words that indicate
the reference of a term or concept: “intension” indicates the internal content of a
term or concept that constitutes its formal definition; and “extension” indicates
its range of applicability by naming the particular objects that it denotes. For
instance, the intension of “ship” as a substantive is “vehicle for conveyance on
4.3. Concept Matching Method 107
water”, whereas its extension embraces such things as cargo ships, passenger
ships, battle ships, and sailing ships. It is easy to know from the example that
the intension is about the attribute of a concept of its own, and the extension is
a list of members with the attribute.
In other words, the intension is used to represent properties and characteristics
of the entire concept while extension is used to represent a specific part of this
concept [124]. A concept is thus described jointly by its intension and extension.
Therefore, it is true to state that extension is part of intension and it cannot
survive on its own. The major advantage of this characteristic of extension is
that it can always be used to define different specific characteristics in the main
concept set, for it can take various combinations of elements in the main set.
Furthermore, the main disadvantage of the characteristic is that extension cannot
be defined on its own. Extension can create various forms of unique characteristics
to be contained in a single developed intension concept, and thus it is of great
importance to the expansion of various aspects in model development. To be able
to develop an advanced matching approach, a proper management of intension
attributes should be considered so as to develop strong unique characteristics
while defining the extension aspect.
Through the previous explanation, the goal of using the concept intension and
extension can be understood. We believe that once a core concept is identified, the
potential useful and relative concepts can be obtained by enlarging the coverage of
its extension. These additional concepts might better contribute to personalized
108 Chapter 4. Relevance Feature Matching
ontologies so as to capture more precise information needs. However, how to
adopt the intension and extension becomes a challenging issue. The two main
questions arise: 1) what are the candidates that should be chosen with intension;
and 2) how many extensions are enough for the candidates. Along with these
questions, we first redefine the definitions of intension and extension as below:
Definition (Intension): A set of terms T that is originally used in the LCSH to
describe the meaning of the concept s. For instance, “blood pressure” a subject
heading s from the LCSH. The set of terms T = {blood, pressure} is the intension
of the concept s.
Definition (Extension): the extension consists of the specific terms in the LCSH.
These terms are used to describe the ideas, properties, or corresponding signs that
are related to the concept s.
The specific terms can be clarified by the LCSH reference types in the chapter
3 section 3.2. There are totally four specific terms that could be associated
with the concepts. The broader terms mean that the extension covers more
general concepts. The narrow terms indicate that the related concepts are more
focusing. These two are used to form the relation of “Is-A”. Another two types
are the used-for terms and the related-to terms. The used-for terms are little bit
similar to the “Is-A” relation because they have the whole/part meaning. Hence,
the extension of the used-for terms is to describe the potential metronymic and
inclusive concepts. The related-to terms present the concepts that have the same
characteristics with a concept’s intension.
4.3. Concept Matching Method 109
For the first question, we determine the use of the SPE terms as the candidates
of intension. This decision aims to coincide the specificity of low-level features.
Since the SPE terms are highly focused on user wants in one specific topic, each
SPE term is realized as a quality or property which can make up a clear concept.
Yet, the term must satisfy a pre-condition to grant the intension, which is
⋃s∈C
Intension(s) ⊇ all SPE terms (4.5)
where Intension(s) denotes the intension of a matched subject in the LCSH. By
meeting the condition 4.5, the concept intensions are well-defined and coherent
with the SPE terms for sure. They always act as core elements to interpret
specific concepts.
The second question is difficult to answer without a certain explicit scope of
extension. Earlier, we described that the extension can help to cover more related
portions in terms of the intension. However, there is no universal judgement
for all possible cases. Taking a real instance for example, a concept intension
is “network”, and one of its extensions is “network security”. Apparently, after
having one more term, the new concept is more specific than merely the intension.
Meanwhile, another extension is added through a large amount of terms “qut b
block network security training”. At this time, the new concept may be confusing
and not appropriate. Our thinking is to make the potential extensions as specific
as possible, or keep them small. The main reason is that since our goal is to find
the most specific concepts, the new coming terms might be excess and useless.
110 Chapter 4. Relevance Feature Matching
Again, they might result in redundancy. As a result, we set up another condition
is to minimize extension in a virtual space, which can be formulated as shown
below. To better understand the condition, a diagram is shown in Figure 4.3.
The condition 4.6 is to get the smallest shadow portion.
Figure 4.3: Difference between extension and intension
∫ sn
s1
(Extension(s)− Intension(s))ds (4.6)
where Extension(s) denotes the extension of a matched subject in the LCSH,
and ds means the variable. The above two conditions guide the design of our new
distributed matching method.
4.3.3 Distributed Matching
The matching occurs between the terms mined from the previous sections and
subjects appearing s ∈ C in the LCSH. Note that the mined terms are dis-
tinguished into three groups and their values can be calculated by the weight
4.3. Concept Matching Method 111
function in Equation. 4.4. In this method, we only use terms in the first scenario
where t ∈ T+ for matching the concepts among the LCSH. The SPE terms are
sufficient and indispensable to represent user desired information according to
our prior testing.
Closely referring to the specified conditions 4.5 and 4.6, this section provides
a distributed method for matching relevance features and concepts. The reason
that the method is called “distributed” is because each SPE term will be assigned
the corresponding concepts individually. In order to satisfy the condition 4.5,
the candidate concepts must involve or exactly cover the particular SPE term
si ⊇ t = {t|t ∈ T+}. The matching method is divided into several steps and
described in detail as follows:
1) To gather the importance of terms in T+, we first sort them based on the
weight values from Equation. 4.4. This action aims to identify the significant
terms from the text mining perspective and highlight them.
2) The second step consists of two phases. For this step, the condition 4.6 is
brought to our attention. Since the scope of extension is difficult to determine,
our strategy is to keep all the original intensions and to minimise their exten-
sions. To control this strategy in a practical way, phase one is allocated the top
25% terms in T+ based on their values. They are considered as the core portion
of local information. Consequently, the four most relevant subjects s ∈ C are
referred to each of the top 25% terms. In the second phase, the relevance of
subjects is computed by rel(s) = |T+∩ s|/|s|, where |T+∩ s| denotes the number
112 Chapter 4. Relevance Feature Matching
of overlapped terms between T+ and subject s, |s| stands for the total number
of terms in the subject s. According to the rel values, four subjects can be con-
firmed. The idea of choosing 25% and four parameters is supported by empirical
experiments. The matching performance can be impacted while modifying these
parameters. A comparison of using other parameters is shown in the evaluation
section.
Algorithm 3 Concept Matching AlgorithmInput:
A set of SPE terms T+; weight(t) from Eq. 4.4; a set of LCSH subjects C.Output:
A set of matched concepts SC.1: Let SC = ∅;2: Sort T+ using weight(t) in descendant order;3: Let K = |T+|/4, Let T+
1 be the top-K terms in T+;4: for each t ∈ T+
1 {5: Let s1 = s2 = s3 = s4 = t;6: Let c(t) = {s ∈ C|t ∈ s};7: Select the top-4 relevant concepts in c(t) using rel(s), and let s1, s2, s3, s4
be the subjects; //if |c(t)| < 4, t will be the default value of si because ofstep 5
8: Let SC = SC ∪ {s1, s2, s3, s4} }9: for each t ∈ (T+ − T+
1 ) {10: Let s1 = t, rel(s1) = 0;11: for each (s ∈ C&t ∈ s)12: if |s∩T
+||s| > rel(s1) then
13: Let s1 = s, rel(s1) = |s∩T+||s| ;
14: Let SC = SC ∪ {s1} }
3) This step is for the rest of the terms (75%) in T+. To find accurate subjects
for each of these terms, we select the most relevant one based on rel value instead
of four. The step can also guarantee that all the intensions of the SPE terms
have been covered. The majority of the extensions are disregarded to reduce the
opportunity of redundancy.
4.4. Summary 113
4) A set of specified concepts and their references existing in the LCSH can be
obtained to form a personalized ontology. By taking advantage of the ontology, a
scope of user background knowledge can be defined, and search systems can offer
tailoring results after understanding precise user preferences.
In most cases, a term t ∈ T+ can successfully find a subject or a set of
subjects s = {s|s ∈ C} in the LCSH. It is a challenging issue when a term cannot
match concepts in the global knowledge base. For example, “dutroux” is not
a valid word/term in the vocabulary but it appears frequently in the training
documents. It could be important to describe user needs. However, no subject
can be matched by the proposed method. To overcome this issue, we count the
term itself as a subject directly as if c(t) = {s ∈ C|t ∈ s} = ∅.
Algorithm 3 illustrates the entire process for our concept matching method,
which enables the work to be repeated and optimize. Noting that the output is a
set of specific concepts SC, the process can be also understood as a transition of
the informative descriptor and conceptional descriptor. These acquired concepts
and their semantic relations in the LCSH are used to construct personalized
ontologies.
4.4 Summary
This chapter presents a systematic approach to build personalized ontologies by
adopting relevance features. The purpose of the approach is to address a hard
question which is how to associate low level features (local information) with high
114 Chapter 4. Relevance Feature Matching
level concepts (global knowledge). The approach also builds a bridge to inter-
preting user information needs by studying the produced personalized ontologies.
The user orientated feature is used to represent user information need and match
concepts appearing in the global knowledge base. At the beginning, the pattern
mining technique for acquiring relevance features is fully introduced including
related definitions, feature classification, and term weighting. To match with ap-
propriate concepts in the global knowledge, the theory of intension and extension
is employed to guide a distributed matching process. The process is an effective
solution to solve the problem of local informative references and common knowl-
edge mismatch. The final deliverable is the tailored personalized ontology, which
is composed of “is-a” superior and subordinate concepts in the world knowledge
base. For evaluation, the standard topics and a large test-bed are employed for
scientific experiments. We expect to gather substantial results to prove that the
proposed matched concepts outperform in information filtering measurements af-
ter comparison with baseline models.
Chapter 5
Evaluation
In the previous chapters, two ontology-based approaches were fully discussed.
One is the ontology-based mining approach, which is abbreviated by OM in this
chapter; and the other is the approach for using relevance features to match con-
cepts in the world knowledge base, which is abbreviated by POM. A common
hypothesis in this dissertation is that personalized ontologies contain useful con-
tent for the capture of user information needs. They can offer more accurate
results when users participate in retrieval activities. User search intentions can
be discovered somewhere in a hierarchical concept-based structure by the first
approach. In the second approach, we consider the specificity features extracted
from positive documents containing user-focused needs. When adopting the rel-
evance features for ontology matching, gathered concepts should be helpful to
115
116 Chapter 5. Evaluation
improve search effectiveness. Related experiments were conducted to evaluate
and support this hypothesis. The similar ontological mining model - ONTO has
been selected as one of the baseline models. This chapter provides an overview
of experiment design, data collection, baseline models, information filtering mea-
surements, and the obtained results.
5.1 Evaluation Environment
5.1.1 Experimental Dataset
The LCSH was chosen as the database for the development of the proposed ap-
proaches. It is originally a 719 mega bytes database stored in Microsoft Office
Access (.mdb). Comprehensive reviews and discussion in the related literature
led to the specifications of its definition, types, roles, and subject headings. In to-
tal, the database contains 20 tables to save data regarding semantics, topics, and
subject headings. After classifying and arranging the useful information, we built
up a table called ontology to store the relevant data for ontology construction,
including items of ID, subject headings, and the assigned semantic relations. Ini-
tially, 491,250 subject headings and their internal references between the headings
were extracted, including ancestor, descendent, part− of , and related− to.
The Reuters Corpus Volume 1 (RCV1) consists of all and only English lan-
guage stories produced by Reuter’s journalists between August 20, 1996, and
August 19, 1997, a total of 806,791 documents that cover a very wide range of
5.1. Evaluation Environment 117
topics and information. TREC (2002) developed and provided 100 assessor top-
ics [85] for the filtering track, aimed at building a robust filtering system. These
topics were developed by human assessors of the National Institute of Standards
and Technology. Relevance judgements on the RCV1 were also made by the as-
sessors. The assessor topics are more reliable than any artificially constructed
topics [100]. For each topic, some documents in the RCV1 are divided into a
training set and a testing set. Both the training and testing set have two kinds
of documents, namely, positive documents and negative documents. The positive
documents have been manually indicated as relevant to the topic, whereas the
negative documents have been indicated as irrelevant. Figure 5.1 clarifies the
structure and relations of the RCV1.
According to Buckley et al. [16], the first 50 topics are stable and sufficient
for high quality experiments. This dissertation uses the RCV1 and the first 50
assessor topics to evaluate the proposed model. The documents in RCV1 are
marked in XML. To avoid bias in experiments, all of the meta-data information
in the collection has been ignored. The documents are treated as plain text
documents by pre-processing the documents. For the pre-processing purpose, an
associated program has been built in the Java programming language. There are
two functions, namely, stemming and stopwords removal. The stemming uses the
Porter stemming algorithm (or Porter stemmer) , which is a process for removing
the more common morphological and inflexional endings from words in English.
Its main use is as part of a term normalization process that is usually done when
5.1. Evaluation Environment 119
setting up information retrieval systems. For instance, before stemming, a text
is presented as follows:
At least 44 people were feared drowned when their vessel capsized in the
Nagavalli river in the southern state of Andhra Pradesh, the United News of
India said on Sunday. It quoted official sources as saying the boat was
carrying some 50 people, mainly tribespeople, when it sank on Saturday. Six
people swam to safety, it said.
Then, the text becomes the following statement after stemming:
least peopl fear drown vessel capsiz nagav river southern state andhra
pradesh unit new india sundai quot offici sourc sai boat carri peopl mainli
tribespeopl sank saturdai six peopl swam safeti
The aim of the stopwords removal is to prune the meaningless terms in a text
(e.g. the, a, for, of etc). As well as the above functions, the program can also
calculate term frequency in a text and compute the relevance value rel(s) specified
on Algorithm 3 for each subject. Figure 5.2 is the primary user interface of the
pre-processing program.
5.1.2 Baseline Models
For evaluation, we employed a number of baseline models. These models belongs
to different categories, including three well-known IR models, two effective pat-
tern mining models, and one ontology mining model. The selected IR models and
parameter settings are outlined as follows:
120 Chapter 5. Evaluation
Figure 5.2: Primary user interface of the coded pre-processing program
• TFIDF [88]: This model has been introduced in related work and is widely
used. The term t can be weighted by w(t) = TF (d, t) × IDF (t), where
term frequency TF (d, t) is the number of times term t occurs in document
d(d ∈ D) and D is a set of documents in the dataset; DF (t) is the document
frequency which is the number of documents where the term t occurs at least
one; IDF (t), the inverse document frequency, is denoted by log( |D|DF (t)
).
• Rocchio [53]: This method generates a centroid for representing user pro-
files by extracting terms from positive documents and performing a revision
of the weights of the terms from negative documents. The centroid ~c of a
topic can be generated as follows:
~c = α1
|D+|∑−→d ∈D+
−→d
||−→d ||− β 1
|D−|∑−→d ∈D−
−→d
||−→d ||
(5.1)
5.1. Evaluation Environment 121
where ||−→d || is the normalized vector for document d. α and β are two
control parameters for the effect of relevant and non-relevant data respec-
tively. According to [?, ?], there are two recommendations for setting the
two parameters: α = 16 and β = 4; and α = β = 1.0. We have tested both
recommendations on the assessor topics and found the latter recommenda-
tion was the best one. Therefore, we let α = β = 1.0.
• Okapi BM25 [86] is one of the state-of-the-art term-based models. The
term weights are estimated using the following probabilistic model-based
equation:
W (t) =tf · (k1 + 1)
k1 · ((1− b) + b DLAVDL
) + tf· log
(r+0.5)(n−r+0.5)
(R−r+0.5)(N−n−R+r+0.5)
(5.2)
where N is the total number of documents in the training set; R is the
number of positive documents in the training set; n is the number of doc-
uments which contain term t; r is the number of positive documents which
contain term t; tf is the term frequency; DL and AVDL are the document
length and average document length, respectively; and k1 and b are the ex-
perimental parameters, where the recommended values of k1 and b for this
data collection are 1.2 and 0.75, respectively [133].
The pattern mining models are listed here:
• PDS [118]: The data mining method for using frequent patterns in the
text was proposed. This method focuses on addressing the difficulties of
122 Chapter 5. Evaluation
using specific long patterns in text by using patterns to accurately weight
low-level terms based on their distributions in the patterns. Given a term
t ∈ D+, the support of term t can be computed as the following function.
w(t) =
|D+|∑i=1
∑t∈p⊆SPi
Supa(p, di)
|p|(5.3)
where SPi denotes a set of closed sequential patterns in document di and
|p| indicates the length of pattern p. The extracted low-level terms are used
to score a test document based on the total weight of the terms contained
in the document.
• RFD [65]: The RFD model was fully introduced by the section 4.2 in
Chapter 4. The core idea is to discover both positive and negative patterns
in text documents as higher level features in order to accurately weight
low-level features (terms) based on their specificity and their distributions
in the higher level features. All the terms are pre-defined into three groups
and used for document weighting and sorting.
The last baseline model is an ontology-based model.
• ONTO [104]: The ONTO model builds personalized ontologies for improv-
ing retrieval effectiveness and is close to our model. The idea of the model
is to use a similarity to determine relevant concepts c for describing the
themes of the local instances, P (c|F ) = P (c ∩ F )/P (F ), where F is a set
of features discovered in the local instances. The conceptual ontology user
5.1. Evaluation Environment 123
profile was also proposed by this model.
5.1.3 Experimental Measurements
In order to prove the accuracy and feasibility of our approaches, an objective
experiment is conducted by applying five state-of-the art IR measuring metrics.
These are the top 20 precision measures based on the relevance judgement in
RCV1 (top@20), the precision averages at 11 standard recall levels (11− points),
the Mean Average Precision (MAP ), the F1-measure (F1), and the breakeven
point (b/p).
The top 20 precision is considered to be the most important standard in
the evaluation since a Web searcher is mostly going to look at the top 20 docu-
ments [64]. In the domain of IR, precision is the percentage of retrieved documents
that are relevant. Obviously, the precision of search results will be improved after
discovering the precise user intents. By using the ranking algorithm 4.3 presented
in the previous chapter, we obtained a list of documents sorted based on the value
of their weights. The relevance of each document in the RCV1 has already been
judged and scored by 0 and 1. Compared to these judgements, the top 20 preci-
sion can be computed as follows.
top@20 =|{first 20 sorted docs} ∩ {relevant docs}|
20
MAP is correlated with average precision (Ave(p)). Ave(p) is the average of
precision at each relevant document retrieved in the ranked sequence. Consisting
of the Ave(p), the equation of MAP is formed as
124 Chapter 5. Evaluation
MAP = 1|Q|
|Q|∑s=1
Ave (p)
where Q stands for the number of queries. The F1-measure was first introduced
by C. J. van Rijsbergen [111]. It combines recall and precision with an equal
weight in the following form:
F1 −Measure = 2×precision×recall(precision+recall)
The 11−points measure is also used to estimate the performance of retrieval mod-
els by averaging precision at 11 standard recall levels (i.e. recall = 0, 0.1, 0.2, ..., 1).
5.2 Experiment Design and Settings
As known that the related experiments are required to evaluate the two pro-
posed approaches. Since their experiment designs are different, we explain them
separately. This section starts with an overview of the experiment design of the
first ontology-based mining approach (abbreviated as OM). In essence, the design
aims to answer which level contains the most proper user information needs in
the LCSH ontology. A simple way is to use the performance of the pilot level
(first matching level) as the benchmark, and then check the performance for all
the levels. However, this method was felt to be too trivial and time-consuming
because of the large number of levels that occurs in the LCSH. The result from
our preliminary study shows that topic R117 has the longest depth of levels (28
levels including upper and lower levels). The shortest depth is owned by topic
R116 (10 levels for two directions). For the first 25 topics in the RCV1, their
5.2. Experiment Design and Settings 125
average depth is 20.32. Based on the observations, we noticed that all these 25
topics have relative subjects in the upper level 7 and lower level 2. Therefore, we
defined these two levels as other benchmarks while testing. Finally, we designed
an experiment to estimate the performance for the first 7 upper and 2 lower levels
one by one. This would help to determine which level is the best candidate to
interpret information of interest to the user. The ONTO model was chosen as the
baseline for comparison because it belongs to a pure ontology-based model. To
conduct a thorough evaluation, all the results would be considered in the specified
information measurements above.
The second approach is a learning approach by adopting relevance features,
named POM here. The goal of the approach is to match a set of concepts in the
LCSH ontology for personalized ontology creation. The design aims to clarify two
points: 1) what are the best candidates for concept matching, and 2) whether
the proposed matching approach is useful. Figure 5.3 provides an overview of
the entire experimental design. First of all, all the models would make use of
the RCV1 training set for feature selection. The produced features are a set of
term weight pairs pending to be matched with concepts. After implementing
the distributed matching method, the features are converted into concepts. The
RCV1 testing set is prepared for evaluating whether the concepts are useful for
improving the mining effectiveness.
As explained in Section 4.3.3, two parameters are defined for distributed
matching. Table 5.1 shows the reason for predefining the parameters as the top
126 Chapter 5. Evaluation
Figure 5.3: Experiment design for POM
25% terms with 4 subjects and 1 for the rest. A number of combination settings
such as 50% 33%, and 25% with different numbers of subjects were tested but are
omitted here for space. After conducting duplicate attempts, we found that using
the association of the top 25% with 4 subjects and 1 for the rest can produce the
best result. This decision also meets the constraints of the conditions in 4.5 and
4.6.
#subjects % top@20 MAP F1 b/p50 0.43 0.3967 0.4103 0.3880
4 33 0.42 0.3941 0.4084 0.385525 0.46 0.4124 0.4195 0.404250 0.44 0.4029 0.4141 0.3968
3 33 0.44 0.4008 0.4125 0.400125 0.45 0.4053 0.4157 0.3971
2 ......
Table 5.1: Comparison results for different parameter settings
5.3 Experimental Results
This section outlines the gathered results regarding the performance of both the
OM and POM approaches.
5.3. Experimental Results 127
5.3.1 Evaluation of Ontology-Based Mining Approach
Table 5.2 summarizes the results for the OM. It gives an intuitive view of all
the levels in the constructed hierarchy. To indicate the influences affected, the
percentage change in performance was used to compute the difference in the top
20 precision, MAP , and F1−Measure results among the levels. It is formulated
as:
%chg =Resultafter−Resultbefore
Resultbefore× 100
The larger the %chg value is, the more significant improvement it achieves.
By observing the output, we can identify that the upper level 7 is the desired
solution to store the user search intent because it has a major improvement
(21.19%) in the top precision compared with the pilot and other levels. All
these statistical results are computed for the average of the first 50 topics in the
RCV1. Another vital point that attracts our attention is the number of subjects
in different levels. The pilot level has the largest number of subjects but with
low performance. This scenario indicates that the majority of the subjects are
redundant, which means they are not useful to represent user wants. However,
for the upper and lower levels, they follow the descending order and the number
of subjects in these levels reduce dramatically. This can be understood by the
nature of the LCSH ontology, which is that not every subject has been specified
with internal references. In addition, we can picture a shape of the upper levels
as a cone, whereas an inverse cone is for the lower levels (chapter 3, Section 3.3).
Based on the table, we can prove that the OM approach works for top precision
128 Chapter 5. Evaluation
enhancement.
pr@20 MeanAve.Pre. F1 −Measure
# Subjects Value % Chg Value % Chg Value % ChgUpper Lv.7 25.96 0.204 21.19 0.228 0.07 0.281 -1.025Upper Lv.6 37.76 0.199 18.42 0.224 -1.43 0.279 -1.66Upper Lv.5 54.04 0.193 14.39 0.225 -1.01 0.281 -1.02Upper Lv.4 75.96 0.18 6.49 0.221 -2.69 0.278 -2.1Upper Lv.3 114.16 0.183 8.53 0.223 -1.9 0.28 -1.35Upper Lv.2 178.8 0.188 11.79 0.229 0.55 0.284 0.08Upper Lv.1 365.16 0.18 7.03 0.231 1.5 0.287 1.15Pilot Lv. 2132.04 0.168 0.228 0.284Lower Lv.1 370.04 0.17 1.19 0.228 0.39 0.284 0.21Lower Lv.2 103.84 0.19 11.31 0.23 0.88 0.285 0.5Lower Lv.3 32.52 0.174 3.54 0.222 -2.41 0.2798 -1.63Lower Lv.4 11.39 0.2 16.14 0.229 0.48 0.284 0.04
Table 5.2: Overall performance of first 50 topics
Focusing on the effect of top precision, an expanded test was conducted to
examine the performance in different experimental stages (topics 1-25, 1-50, 1-
100). We separated the testing into three stages according to the RCV1 topics.
The results also compute the averages and compare the benchmarks and the
ontology-based baseline, namely, the ONTO model. As seen in Figure 5.4, the
results produced by our model consistently outperform the others. In the first
stage (topics 1-25), the improvement is most distinct. During the next stage of
topics 1-50, the result of our model is similar to the one from the upper level 7. In
the final stage, the improvement changes to be obvious again. The possible reason
is that the last 50 topics were generated by the machine learning algorithm.
Figure 5.5 shows the comparison result of 11-points in the first 50 RCV1 topics.
The performance indicates that our approach is better than other approaches
at the early stage, and slightly worst at last. The possible reason is about the
5.3. Experimental Results 129
Figure 5.4: Top 20 precision for three stages
redundancy. When moving toward different levels, the uncertain number of terms
would be increased for each RCV1 topic. So far, there is no way to prove that the
increased terms are all relevant to information needs. Based on our observation,
we claim that the majority of terms are noise in terms of the prospective of
text mining. For example, the precise terms economy espionage have a parent
business. The terms are apparently related in the logical point of view. However,
the term business would possibly become a noisy while adopting information
filtering measure.
Limitations: Three main limitations exist in this approach. The first one is
that: our investigation is mainly focused on the usage of is − a relations in
the LCSH. The other relations, including used − for and related − to are not
regarded in our approach. As a result, the maximum number of depth relations
detected based on the constructed concept hierarchy is 28, not 37 as specified
130 Chapter 5. Evaluation
Figure 5.5: Comparison of 11-points in first 50 topics
in the LCSH specification [104]. Some useful implicit knowledge might be not
entirely discovered from world knowledge representation. The second limitation
is caused naturally from the LCSH. In reality, user interests are usually changing
all the time. The choice and form of headings are not necessarily current because
the LCSH terms have evolved over time, but they can never be totally up-to-date.
This might lead to misinterpretation of user search intentions. The last limitation
is about the dataset applied for evaluation. It is a textual collection of news, but
the database used for searching is a subject collection of library headings. This
might possibly influence the experimental results.
5.3. Experimental Results 131
5.3.2 Evaluation of Distributed Matching
As shown in Figure 5.3, all the models need to have features before taking the
matching into account. Consequently, every model was first executed along with
its particular techniques for feature selection. All the generated features are
treated as a set of term weight pairs, like feature = {< t1, w1 >,< t2, w2 >
, ..., < ti, wi >}, where i denotes to the number of terms in the feature. In fact,
the numbers of terms produced by diverse models are completely different.The
table 5.3 summarizes the differences. The number of terms for each model is
the average result in the first 50 RCV1 topics. Here, we list an item as “RCV1
Title”. It is the number of terms appearing in the original RCV1 titles, which
were summarized by news experts. The “RFD-T+” represents the number of
terms appearing in the specificity group of the RFD model.
Model #termsRCV1 Title 3.04
RFD-T+ 23.12TFIDF 147.32Rocchio 621.96BM25 615.86PDS 154.82
ONTO 75.96
Table 5.3: Number of terms extracted by all the models
Due to the diversity, we made a solution to fairly evaluate the proposed match-
ing approach. Firstly, all the extracted terms by one model are sorted based on
the assigned weight wi. Then, a cut-off is made to keep the terms in the first
23 positions. At last, these top 23 terms are determined as representatives of
132 Chapter 5. Evaluation
the model to establish the concept-based matching. They are regarded as a long
query Q for expressing user preferences.
To prove that our matched concepts can truly contain user information needs,
all terms in the matched concepts are revised with a new weight as follows:
weight(t,Θ) =∑
t∈s,s∈SC
rel(s)/|Q|
where Q denotes the long query in the paragraph. For each model, we use the
revised weights computed by Equation. 4.3 to rank documents for the first 50
topics in the RCV1 testing set. Finally, the comprehensive experimental results
are obtained and showed in Table 5.4 in the first 50 topics. As displayed, RFD-
T+ achieves excellent performance with 11.82% (maximum 13.88% and minimum
7.36%) in percentage change on average for all four measures. The deploying
method is:
weight(t,Θ) = 1/|the occurrence of t|
#terms top@20 MAP F1 b/pRFD-T+ 23.12 0.467 0.42193 0.42416 0.41356TFIDF 23.00 0.317 0.30218 0.34484 0.29754Rocchio 23.00 0.463 0.42266 0.42575 0.39965BM25 23.00 0.452 0.41328 0.41872 0.41231PDS 23.00 0.448 0.41435 0.42059 0.40016
ONTO 23.00 0.335 0.34035 0.37876 0.33552%chg +12.57% +13.47% +7.36% +13.88%
Table 5.4: Comparison results after matching
In Table 5.6, we list the original results produced by all the models for analysis.
5.3. Experimental Results 133
A big difference is the number of extracted terms by different techniques. There
are 23.12 SPE terms on average, which is approximately 7 times less than the
number of terms extracted by TFIDF (147.32) and PDS (154.82). It is around
27 times less the Rocchio (621.96) and BM25 (615.86). This demonstrates that
the SPE terms are quantitatively enough to summarize all user needs with a
small number of words. The matched concepts should capture all concrete user
knowledge. It is clear that all the models after adopting the developed matching
process perform with similar results compared with the their original results. For
some of models, including the BM25 and ONTO, they even achieve slightly better
results.
#terms top@20 MAP F1 b/pRFD-T+ 23.12 0.413 0.37591 0.39476 0.37036TFIDF 23.00 0.318 0.30049 0.34118 0.30245Rocchio 23.00 0.439 0.39196 0.40463 0.38596BM25 23.00 0.457 0.42098 0.42633 0.41588PDS 23.00 0.444 0.39963 0.41115 0.38465
ONTO 23.00 0.26 0.29237 0.33729 0.29207
Table 5.5: Comparison of deploying results
In terms of the displayed results from Table 5.5, we found that the deploying
method is not suitable to the predication of concept matching. Although the de-
ployment are widely applied by data mining models for prediction or classification
to new data, the results are not good as the prior weighting method weight(t,Θ).
Figure 5.6 shows the 11 − points comparison after adopting the matching
approach. It is noted that the relevance features of RFD-T+ perform a little
better than the other models. This is the evidence upon with we identify the
134 Chapter 5. Evaluation
#terms top@20 MAP F1 b/pRFD-T+ 23.12 0.489 0.44826 0.44457 0.43804TFIDF 147.32 0.369 0.344278 0.37128 0.34418Rocchio 621.96 0.475 0.4305 0.4299 0.4201BM25 615.86 0.445 0.4069 0.414 0.4074PDS 154.82 0.496 0.44 0.444 0.439
ONTO 75.96 0.328 0.3455 0.37237 0.33015
Table 5.6: Comparison of original results
relevance features of RFD-T+ as the best alternative for queries to express user
information needs.
Figure 5.6: 11 points result after matching
As well as Table 5.6, we also provide Figure 5.7 to show the original 11 points
result of the models. Comparing Figure 5.6 and Figure 5.7, there is no major
difference between the original results and the results after matching. The trends
of the curves are consistent. After matching, the concepts can recover the perfor-
mance achieved by the features. The consequence was expected because it can
5.4. Summary 135
confirm that the matching works properly and the matched concepts are relevant
to the user’s desired information.
Figure 5.7: Original 11 points result
5.4 Summary
In this chapter, we discussed the evaluation of the proposed approaches. At the
beginning of the chapter, we introduced the datasets and baseline models that
were used for comparison, as well as the applied information filtering measure-
ments. The designs were then clearly outlined to provide a general overview of the
evaluation. Finally, it was explained that the ontology-based mining approach
was proven to be successful based on the experiments. The improvements are
consistent and significant according to on the top 20 precision measure. The re-
lated results indicate the overall performance is better than the baseline model.
136 Chapter 5. Evaluation
For the proposed matching approach, the substantial results show that it can suc-
cessfully address the problem of local instances and global knowledge mismatch,
and indicates an effective way to discover desired concepts by adopting relevance
features for building personalized ontologies. The dramatic improvements are
significant after comparison with three state-of- art IR, two popular text mining,
and one ontology-based baseline models. More importantly, three fundamental
points can be summarized based on the discussion in this chapter as follows:
1) The performance of the proposed matching approach is consistently better
for all feature selection models; 2) Based on the experimental results, there is
an accepted recommendation that: using the weighting function is better than
using the deploying method for concepts in the LCSH ontology; 3) The results
produced by the proposed matching approach are approximately equal to the
results from the state-of-art IR and effective data mining models.
Chapter 6
Conclusion
In this dissertation, we have proposed a potentially semantic and novel approach
to discover user search intent from personalized ontologies. The key aspect of
our innovation is the deployment of an ontology-based mining approach that fa-
cilitates responses to requests for information with high top 20 precision. The
approach constructs a backbone of a concept hierarchy to store relevant subjects
gained by a term-based search model. By learning the concepts and their refer-
ences from a general ontology, more potential relative concepts can be inferred
in other virtual levels and tracked in a two dimensional structure. An iterative
algorithm is developed to evaluate the performance level by level until meeting
the best result. A huge test bed was utilized for a number of experiments. The
gathered experimental results demonstrated that this approach is working and
137
138 Chapter 6. Conclusion
promising. It can enhance the search effectiveness within top precision. We
are also confident that the fundamental conceptual framework is sound, and its
implementation is feasible form a technical standpoint.
This dissertation has also introduced an automated approach to allocate local
information into global knowledge. This is a significant work that transfers low-
level features to high-level concepts. It builds a bridge to construct personalized
ontologies by adopting the results from different types of models. The approach
uses term-to-concept matching between user requests mined from documents and
concepts learned from a world ontology rather than keyword-to-keyword match-
ing. Therefore, the key problem in the use of this technique is to identify and
match appropriate concepts which describe specific items of interest to the user
on one hand, and on the other hand, to employ the candidate features in the
user information needs. It is critical to ensure that irrelevant concepts will not be
associated and matched, and that relevant concepts will not be discarded. To ad-
dress these concerns, the approach takes advantage of the theory of intension and
extension for concept selection. It is a creative attempt because the theory has
not been used in ontology learning, but only for granular and rough set mining.
The intension ensures that the relevant concepts would not be missed, whereas
the extension supplies possible relevant concepts. In evaluation, the standard as-
sessing topics and a huge testbed were employed for scientific experiments. The
overall results prove that the proposed approach is successful and reliable after
comparison with diverse baseline models.
139
The above innovations and developments enable us to make a claim that
the ontology-based technique will become more popular as a search mechanism
expected to offer intelligent service in the near future. In terms of our study,
data mining techniques are very effective to mine valuable information. At the
current stage, it is difficult for a pure ontology-based model to outperform the
data mining models for many reasons. One reason 1) is that the data mining
models always reply keywords or core features that are related to the keywords.
However, the keywords might disappear when the model describes the user need
in concepts. For example, the keywords “optical mouse” can help a user to
find the desired information. Once interpreted by the concept, the description
could change to “computer accessory”. As a result, the results would be totally
different. Another reason 2) for the poorer performance of a pure ontology-based
model is the length of a concept. In general, a concept consists of a small number
of terms. This factor is a limitation when a user directly uses the concept for
searching because the small number of terms provides restricted information to
present the user’s wants. Many data mining models have already proven that long
patterns achieve better results than short patterns. This is due to the fact that
the long patterns have sufficient terms to make the expression discriminatingly.
In addition, the data mining models have been trained according to the context
of documents, which is inoperational for the ontology-based models. The third
reason 3) is related to the ontologies themselves. The ontologies are developed by
domain experts for individual objectives. They might not cover all the concepts
140 Chapter 6. Conclusion
for each domain. This leads to a series of problems such as misinterpretation, lack
of content, and complexity in use. The last reason 4) is related to the ontology
evaluation. After producing an ontology for a model, researchers struggle with a
common issue that is, how to evaluate the ontology. There is no existing standard,
suitable measurement, or baseline recognized by most research communities for
ontology examination. Consequently, it is difficult to judge whether the ontology
is correct or not.
6.1 Future work
At this stage, two ontological mining approaches were developed separately though
they have the same objective. The first approach builds a backbone of person-
alized ontology based on the overlapping concepts of user proposed queries. In
order to improve the retrieval performance, the second approach was carried out
to match relevant concepts by relevance features extracted from local documents.
In other words, the second was proposed for serving the first approach. However,
the conduced evaluation had not provided the related contents after integrating
the two approaches to facilitate knowledge generalization process. As a result,
an urgent task for the next step is to look for the possibility of combing the two
approaches. An entire personalized ontology system is expected for discovering
desirable user information needs.
In future, we plan to investigate the use of learned ontologies by utilizing the
abundant semantic relations including part-of and related-to among concepts.
6.1. Future work 141
The part-of semantic relation is formalized in the LCSH by the reciprocal refer-
ences User-For and provides a mereological part-hood relationship between two
subjects. The related-to semantic relation is formally represented in the LCSH
as a bi-directional associative relationship Related-To in which two subjects are
linked in some manner other than hierarchy. It is noted that ontologies play an
important role as the backbone for facilitating access to information in knowledge
management systems. Consequently, how to properly take advantage of these re-
lations is an interesting question that is worth to study since the present work
aimed to define user wants in terms of extensive concepts in world knowledge. In
addition, we are quite interested in weight revising methods for performance en-
hancement. Future investigations would extend the applicability for the majority
of existing Web documents, and maximize the contribution of the present work.
For future study, we would also aim to try the approaches with other ontology
datasets, such as WordNet. This plan is motivated by the problem that the
existing LCSH dataset is a collection of library subject headings. Yet, in the
evaluation, the employed testing dataset was from a collection of news documents.
They look unitary and inconsistent. The WordNet dataset classifies all knowledge
into seven main separate categories. Each category is an individual ontology that
is useed to discuss a focused domain. In addition, the WordNet has relatively
complete relations that are similar to the LCSH but they are considered based on
a variety of words, including verbs, nouns, adjectives, and adverbs. Beside, we are
also interested in comparing with other language modeling approaches for XML
142 Chapter 6. Conclusion
component retrieval like the hierarchical language model (HLM) [82]. Because the
model can smooth with parent/child elements and differential weighting for each
element type. Once the personalized ontology is constructed from our approach, it
can be interpreted as a semi-structured document. In this sense, the representing
and ranking of scores for XML possibly can be applied.
Bibliography
[1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by
incorporating user behavior information. In Proceedings of the 29th ACM
SIGIR conference on Research and development in information retrieval
(SIGIR ’06), page 26. ACM, 2006.
[2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. In ACM SIGMOD Record, volume 22,
pages 207–216. ACM, 1993.
[3] R. Agrawal and R. Srikant. Mining sequential patterns. In Data Engineer-
ing, 1995. Proceedings of the Eleventh International Conference on, pages
3–14. IEEE, 1995.
[4] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules.
In Proc. 20th Int. Conf. Very Large Data Bases, VLDB, volume 1215, pages
487–499, 1994.
[5] A. Algarni, Y. Li, and X. Tao. Mining specific and general features in
both positive and negative relevance feedback. In Text Retrieval Conference
143
144 BIBLIOGRAPHY
(TREC 2009), 2009.
[6] G. Antoniou and F. Harmelen. Web ontology language: Owl. Handbook on
ontologies, pages 91–110, 2009.
[7] G. Antoniou and F. Van Harmelen. A semantic web primer. The MIT
Press, 2004.
[8] G. Antoniou and F. van Harmelen. A Semantic Web Primer, (Cooperative
Information Systems). 2008.
[9] F. Baader. The description logic handbook: theory, implementation, and
applications. Cambridge Univ Pr, 2003.
[10] R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval.
Addison-Wesley Reading, MA, 1999.
[11] R. J. Bayardo Jr. Efficiently mining long patterns from databases. In ACM
Sigmod Record, volume 27, pages 85–93. ACM, 1998.
[12] S. Bechhofer, F. Van Harmelen, J. Hendler, I. Horrocks, D. McGuinness,
P. Patel-Schneider, L. Stein, et al. OWL web ontology language reference.
W3C recommendation, 10:2006–01, 2004.
[13] N. Belkin and W. Croft. Information filtering and information retrieval:
two sides of the same coin? Communications of the ACM, 35(12):29–38,
1992.
BIBLIOGRAPHY 145
[14] T. Berners-Lee, J. Hendler, O. Lassila, et al. The semantic web. Scientific
american, 284(5):28–37, 2001.
[15] J. Bhogal, A. Macfarlane, and P. Smith. A review of ontology based query
expansion. Information Processing & Management, 43(4):866–886, 2007.
[16] C. Buckley and E. Voorhees. Evaluating evaluation measure stability. In
Proceedings of the 23rd ACM SIGIR conference on Research and develop-
ment in information retrieval (SIGIR ’00), pages 33–40, 2000.
[17] A. Carlyle. Matching LCSH and user vocabulary in the library catalog.
Cataloging & Classification Quarterly, 10(1):37–63, 1989.
[18] L. Chan. Library of Congress subject headings: principles of structure
and policies for application. Cataloging Distribution Service, Library of
Congress, 1990.
[19] L. Chan. Exploiting LCSH, LCC, and DDC To Retrieve Networked Re-
sources: Issues and Challenges. 2000.
[20] M. Chau, D. Zeng, and H. Chen. Personalized spiders for web search and
analysis. In Proceedings of the 1st ACM/IEEE-CS joint conference on Dig-
ital libraries, page 87. ACM, 2001.
[21] C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using
automatic classification. In Proceedings of the Sixth International Confer-
ence on the World Wide Web. Citeseer, 1997.
146 BIBLIOGRAPHY
[22] C. Chen, S. Hsu, Y. Li, and C. Peng. Personalized intelligent m-learning
system for supporting effective english learning. In Systems, Man and Cy-
bernetics, 2006. SMC’06. IEEE International Conference on, volume 6,
pages 4898–4903. IEEE, 2006.
[23] M. Chen and A. Hauptmann. Discriminative fields for modeling semantic
concepts in video. In Large Scale Semantic Access to Content (Text, Image,
Video, and Sound), pages 151–166. LE CENTRE DE HAUTES ETUDES
INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2007.
[24] Y. Chen, T. Suel, and A. Markowetz. Efficient query processing in geo-
graphic web search engines. In Proceedings of the 2006 ACM SIGMOD
international conference on Management of data, page 288. ACM, 2006.
[25] P. Chirita, C. Firan, and W. Nejdl. Summarizing local context to person-
alize global web search. In Proceedings of the 15th ACM international con-
ference on Information and knowledge management, pages 287–296. ACM,
2006.
[26] P. Chirita, C. Firan, and W. Nejdl. Personalized query expansion for the
web. In Proceedings of the 30th ACM SIGIR conference on Research and
development in information retrieval (SIGIR ’07), pages 7–14. ACM, 2007.
[27] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam,
and S. Slattery. Learning to construct knowledge bases from the world wide
web. Artificial Intelligence, 118(1):69–113, 2000.
BIBLIOGRAPHY 147
[28] W. Dakka and P. G. Ipeirotis. Automatic extraction of useful facet hierar-
chies from text databases. In Data Engineering, 2008. ICDE 2008. IEEE
24th International Conference on, pages 466–475. IEEE, 2008.
[29] Z. Dou, R. Song, J. Wen, and X. Yuan. Evaluating the Effectiveness of
Personalized Web Search. IEEE Transactions on Knowledge and Data En-
gineering, 21(8):1178–1190, 2009.
[30] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to
knowledge discovery in databases. AI magazine, 17(3):37, 1996.
[31] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances
in knowledge discovery and data mining. 1996.
[32] R. Feldman and J. Sanger. The text mining handbook: advanced approaches
in analyzing unstructured data. Cambridge University Press, 2006.
[33] D. Florescu, D. Kossmann, and I. Manolescu. Integrating keyword search
into XML query processing. Computer Networks, 33(1-6):119–135, 2000.
[34] M. Fontoura, V. Josifovski, R. Kumar, C. Olston, A. Tomkins, and S. Vas-
silvitskii. Relaxation in text search using taxonomies. Proceedings of the
VLDB Endowment, 1(1):672–683, 2008.
[35] S. Fox, K. Karnawat, M. Mydland, S. Dumais, and T. White. Evaluating
implicit measures to improve web search. ACM Transactions on Informa-
tion Systems (TOIS), 23(2):168, 2005.
148 BIBLIOGRAPHY
[36] W. Frawley, G. Piatetsky-Shapiro, and C. Matheus. Knowledge discovery
in databases: An overview. AI magazine, 13(3):57, 1992.
[37] B. Gaines and M. Shaw. Supporting the creativity cycle through visual
languages. In AAAI Spring Symposium: AI and Creativity, pages 155–162,
1993.
[38] E. Garcia and M. Sicilia. User interface tactics in ontology-based informa-
tion seeking. PsychNology Journal, 1(3):242–255, 2003.
[39] S. Gauch, J. Chaffee, and A. Pretschner. Ontology-based personalized
search and browsing. Web Intelligence and Agent Systems, 1(3):219–234,
2003.
[40] F. Giunchiglia, D. Soergel, V. Maltese, and A. Bertacco. Mapping large-
scale knowledge organization systems. Technical report, University of
Trento, 2009.
[41] C. Govindarajulu and B. J. Reithel. Beyond the information center: an
instrument to measure end-user computing support from multiple sources.
Information & management, 33(5):241–250, 1998.
[42] B. Grau. A possible simplification of the semantic web architecture. In
Proceedings of the 13th international conference on World Wide Web, pages
704–713. ACM, 2004.
BIBLIOGRAPHY 149
[43] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: cur-
rent status and future directions. Data Mining and Knowledge Discovery,
15(1):55–86, 2007.
[44] J. Han and Y. Fu. Dynamic generation and refinement of concept hierar-
chies for knowledge discovery in databases. In Proc. Aaai, volume 94, pages
157–168, 1994.
[45] J. Han, S. Nishio, H. Kawano, and W. Wang. Generalization-based data
mining in object-oriented databases using an object cube model. Data &
Knowledge Engineering, 25(1):55–97, 1998.
[46] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. Freespan:
frequent pattern-projected sequential pattern mining. In Proceedings of the
sixth ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 355–359. ACM, 2000.
[47] J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu.
Prefixspan: Mining sequential patterns efficiently by prefix-projected pat-
tern growth. In Proceedings of the 17th International Conference on Data
Engineering, pages 215–224, 2001.
[48] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without
candidate generation: A frequent-pattern tree approach. Data mining and
knowledge discovery, 8(1):53–87, 2004.
150 BIBLIOGRAPHY
[49] M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search
engines. In ACM SIGIR Forum, volume 36, page 22. ACM, 2002.
[50] D. Hiemstra. A probabilistic justification for using tf× idf term weighting in
information retrieval. International Journal on Digital Libraries, 3(2):131–
139, 2000.
[51] S. Jaroszewicz and D. Simovici. Interestingness of frequent itemsets using
bayesian networks as background knowledge. In Proceedings of the tenth
ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 178–186. ACM, 2004.
[52] X. Jiang and A. Tan. Learning and inferencing in user ontology for per-
sonalized Semantic Web search. Information Sciences, 179(16):2794–2808,
2009.
[53] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for
text categorization. Technical report, DTIC Document, 1996.
[54] Y. Kalfoglou and M. Schorlemmer. Ontology mapping: the state of the art.
The knowledge engineering review, 18(01):1–31, 2003.
[55] K. Keenoy and M. Levene. Personalisation of web search. Intelligent Tech-
niques for Web Personalization, pages 201–228, 2005.
[56] I. Keleberda, V. Repka, and Y. Biletskiy. Building learner’s ontologies to
assist personalized search of learning objects. In Proceedings of the 8th
BIBLIOGRAPHY 151
international conference on Electronic commerce: The new e-commerce:
innovations for conquering current barriers, obstacles and limitations to
conducting successful business on the internet, page 573. ACM, 2006.
[57] D. Kelly and J. Teevan. Implicit feedback for inferring user preference: a
bibliography. In ACM SIGIR Forum, volume 37, pages 18–28. ACM, 2003.
[58] L. Khan and F. Luo. Ontology construction for information selection. In
Proceedings of the 14th 2002 IEEE International Conference on Tools with
Artificial Intelligence, pages 122–127, 2002.
[59] J. King, Y. Li, X. Tao, and R. Nayak. Mining world knowledge for analysis
of search engine content. Web Intelligence and Agent Systems, 5(3):233–
253, 2007.
[60] A. J. Knobbe and E. K. Ho. Pattern teams. In Knowledge Discovery in
Databases: PKDD 2006, pages 577–584. Springer, 2006.
[61] H. Kum, J. Pei, W. Wang, and D. Duncan. Approxmap: Approximate
mining of consensus sequential patterns. In Proceedings of SIAM Int. Conf.
on Data Mining, 2003.
[62] S. Lawrence. Context in web search. Bulletin of the Technical Committee
on, page 25, 2000.
152 BIBLIOGRAPHY
[63] M. Lee, K. Tsai, and T. Wang. A practical ontology query expansion
algorithm for semantic-aware learning objects retrieval. Computers & Ed-
ucation, 50(4):1240–1257, 2008.
[64] H. Leighton and J. Srivastava. First 20 precision among world wide web
search services(search engines). Journal of the American Society for Infor-
mation Science, 50(10):870–881, 1999.
[65] Y. Li, A. Algarni, and N. Zhong. Mining positive and negative patterns
for relevance feature discovery. In Proceedings of the 16th ACM SIGKDD
conference on Knowledge discovery and Data mining (KDD ’10), pages
753–762, 2010.
[66] Y. Li and N. Zhong. Mining ontology for automatically acquiring web user
information needs. IEEE Transactions on Knowledge and Data Engineer-
ing, 18(4):554–568, 2006.
[67] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Lau. A two-stage text mining
model for information filtering. In Proceedings of the 17th ACM conference
on Information and Knowledge Management (CIKM ’08), pages 1023–1032,
2008.
[68] F. Liu, C. Yu, and W. Meng. Personalized web search for improving re-
trieval effectiveness. IEEE Transactions on knowledge and data engineering,
16(1):28–40, 2004.
BIBLIOGRAPHY 153
[69] H. Liu and H. Motoda. Feature selection for knowledge discovery and data
mining. Springer, 1998.
[70] X. Long and T. Suel. Three-level caching for efficient query processing in
large web search engines. World Wide Web, 9(4):369–395, 2006.
[71] A. Maedche. Ontology learning for the semantic web. Springer Netherlands,
2002.
[72] A. Maedche and S. Staab. Learning ontologies for the semantic web. IEEE
Intelligent Systems, 16(2):72–79, 2001.
[73] M. Mampaey, N. Tatti, and J. Vreeken. Tell me what i need to know: suc-
cinctly summarizing data with itemsets. In Proceedings of the 17th ACM
SIGKDD international conference on Knowledge discovery and data min-
ing, pages 573–581. ACM, 2011.
[74] F. Manola and E. Miller. RDF Primer. W3C Recommendation
(2004). World Wide Web Consortium (W3C),¡ http://www. w3.
org/TR/2004/REC-rdf-primer-20040210/¿(Last access date: 30 Aug
2004).
[75] W. Mischo. Library of Congress Subject Headings. Cataloging & Classifi-
cation Quarterly, 1(2):105–124, 1982.
[76] N. Nanas, V. Uren, and A. De Roeck. A comparative evaluation of term
weighting methods for information filtering. In Proceedings of the 4th
154 BIBLIOGRAPHY
IEEE International Workshop on natural language and information sys-
tems (NLIS ’04), pages 13–17, 2004.
[77] M. Naphade, L. Kennedy, J. Kender, S. Chang, J. Smith, P. Over, and
A. Hauptmann. A light scale concept ontology for multimedia understand-
ing for trecvid 2005. IBM Research Report RC23612 (W0505-104), 2005.
[78] R. Navigli and P. Velardi. An analysis of ontology-based query expan-
sion strategies. In Proceedings of the 14th European Conference on Ma-
chine Learning, Workshop on Adaptive Text Extraction and Mining, Cavtat-
Dubrovnik, Croatia, pages 42–49, 2003.
[79] R. Navigli, P. Velardi, and A. Gangemi. Ontology learning and its appli-
cation to automated terminology translation. IEEE Intelligent Systems,
18(1):22–31, 2003.
[80] N. Noy. Semantic integration: a survey of ontology-based approaches. SIG-
MOD record, 33(4):65–70, 2004.
[81] D. Oberle, R. Volz, B. Motik, and S. Staab. An extensible ontology soft-
ware environment. Handbook on Ontologies, International Handbooks on
Information Systems, pages 311–333, 2004.
[82] P. Ogilvie and J. Callan. Hierarchical language models for xml compo-
nent retrieval. In Advances in XML Information Retrieval, pages 224–237.
Springer, 2005.
BIBLIOGRAPHY 155
[83] J. Pei, J. Han, and W. Wang. Constraint-based sequential pattern mining:
the pattern-growth methods. Journal of Intelligent Information Systems,
28(2):133–160, 2007.
[84] U. Priss. Formal concept analysis in information science. Annual review of
information science and technology, 40:521, 2006.
[85] S. Robertson and I. Soboroff. The trec 2002 filtering track report. In
Proceedings of the 10th Text REtrieval Conference (TREC ’01), pages 26–
37, 2002.
[86] S. Robertson, S. Walker, and M. Beaulieu. Experimentation as a way of
life: Okapi at trec. Information Processing & Management, 36(1):95–108,
2000.
[87] D. Rose and D. Levinson. Understanding user goals in web search. In
Proceedings of the 13th international conference on World Wide Web, pages
13–19. ACM, 2004.
[88] G. Salton and C. Buckley. Term-weighting approaches in automatic text
retrieval. Information processing & management, 24(5):513–523, 1988.
[89] G. Salton and M. McGill. Introduction to modern information retrieval.
McGraw-Hill New York, 1983.
[90] M. Sanderson and B. Croft. Deriving concept hierarchies from text. In
Proceedings of the 22nd annual international ACM SIGIR conference on
156 BIBLIOGRAPHY
Research and development in information retrieval, pages 206–213. ACM,
1999.
[91] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative
filtering recommendation algorithms. In Proceedings of the 10th interna-
tional conference on World Wide Web, pages 285–295. ACM, 2001.
[92] V. Schickel-Zuber and B. Faltings. Inferring user’s preferences using on-
tologies. In Proceedings of the national conference on artificial intelligence,
volume 21, page 1413. Menlo Park, CA; Cambridge, MA; London; AAAI
Press; MIT Press; 1999, 2006.
[93] F. Sebastiani. Machine learning in automated text categorization. ACM
computing surveys (CSUR), 34(1):1–47, 2002.
[94] S. Sendhilkumar and T. Geetha. Personalized ontology for web search per-
sonalization. In Proceedings of the 1st Bangalore annual Compute confer-
ence, page 18. ACM, 2008.
[95] C. Shahabi and Y. Chen. Web information personalization: Challenges
and approaches. Databases in Networked Information Systems, pages 5–15,
2003.
[96] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using
implicit feedback. In Proceedings of the 28th annual international ACM
SIGIR conference on Research and development in information retrieval,
page 50. ACM, 2005.
BIBLIOGRAPHY 157
[97] Y. Shen, Y. Li, Y. Xu, R. Iannella, A. Algarni, and X. Tao. An ontology-
based mining approach for user search intent discovery. In Proceedings of the
16th Australasian Document Computing Symposium (ADCS 2011), pages
39–46. Royal Melbourne Institute of Technology, 2011.
[98] A. Sieg, B. Mobasher, and R. Burke. Web search personalization with
ontological user profiles. In Proceedings of the 16th ACM conference on In-
formation and Knowledge Management (CIKM ’07), pages 525–534, 2007.
[99] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowl-
edge discovery systems. Knowledge and Data Engineering, IEEE Transac-
tions on, 8(6):970–974, 1996.
[100] I. Soboroff and S. Robertson. Building a filtering test collection for trec
2002. In Proceedings of the 26th ACM SIGIR conference on Research and
development in informaion retrieval (SIGIR ’03), pages 243–250, 2003.
[101] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. Advances in Database TechnologyEDBT’96,
pages 1–17, 1996.
[102] S. Staab and R. Studer. Handbook on ontologies. Springer Verlag, 2004.
[103] P. Tan, V. Kumar, and J. Srivastava. Selecting the right interesting-
ness measure for association patterns. In Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data min-
ing, pages 32–41. ACM, 2002.
158 BIBLIOGRAPHY
[104] X. Tao, Y. Li, and N. Zhong. A personalized ontology model for web infor-
mation gathering. IEEE Transactions on Knowledge and Data Engineering,
23(4):496–511, 2011.
[105] X. Tao, Y. Li, N. Zhong, and R. Nayak. Ontology mining for personalized-
web information gathering. In Proceedings of the IEEE/WIC/ACM Inter-
national Conference on Web Intelligence, pages 351–358. IEEE Computer
Society, 2007.
[106] X. Tao, Y. Li, N. Zhong, and R. Nayak. An ontology-based framework for
knowledge retrieval. In IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08, volume 1,
2008.
[107] S. Tata and J. Patel. Estimating the selectivity of tf-idf based cosine simi-
larity predicates. ACM SIGMOD Record, 36(2):7–12, 2007.
[108] N. Tatti. Maximum entropy based significance of itemsets. Knowledge and
Information Systems, 17(1):57–77, 2008.
[109] J. Trajkova and S. Gauch. Improving ontology-based user profiles. In
Proceedings of Conference Adaptivity, Personalization and Fusion of Het-
erogeneous Information (RIAO ’04), volume 4, pages 380–389, 2004.
[110] F. Van Harmelen and D. McGuinness. OWL web ontology lan-
guage overview. World Wide Web Consortium (W3C) recommendation,
http://www. w3. org/TR/2004/REC-owl-features-20040210, 2004.
BIBLIOGRAPHY 159
[111] C. Van Rijsbergen. Information retrieval, chapter 7. Butterworths, London,
2:111–143, 1979.
[112] A. Varzi. Spatial reasoning and ontology: parts, wholes, and locations.
Handbook of Spatial Logics, pages 945–1038, 2007.
[113] E. Voorhees. Query expansion using lexical-semantic relations. In Pro-
ceedings of the 17th annual international ACM SIGIR conference on Re-
search and development in information retrieval, pages 61–69. Springer-
Verlag New York, Inc., 1994.
[114] B. Wang, R. McKay, H. Abbass, and M. Barlow. A comparative study
for domain ontology guided feature extraction. In Proceedings of the 26th
Australasian computer science conference, volume 16, pages 69–78, 2003.
[115] T. Wang, B. Parsia, and J. Hendler. A survey of the web ontology landscape.
The Semantic Web-ISWC 2006, pages 682–694, 2006.
[116] R. White, J. Jose, and I. Ruthven. Comparing explicit and implicit feed-
back techniques for web retrieval: Trec-10 interactive track report. NIST
SPECIAL PUBLICATION SP, pages 534–538, 2002.
[117] R. White, I. Ruthven, and J. Jose. The use of implicit evidence for relevance
feedback in web retrieval. Advances in Information Retrieval, pages 449–
479, 2002.
160 BIBLIOGRAPHY
[118] S. Wu, Y. Li, and Y. Xu. Deploying approaches for pattern refinement in
text mining. In Proceedings of the 6th IEEE Conference on Data Mining
(ICDM ’06), pages 1157–1161, 2006.
[119] S. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen. Automatic pattern-taxonomy
extraction for web mining. In Proceedings of IEEE/WIC/ACM Interna-
tional Conference on Web Intelligence (WI ’04), pages 242–248, 2004.
[120] D. Xin, X. Shen, Q. Mei, and J. Han. Discovering interesting patterns
through user’s interactive feedback. In Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data min-
ing, pages 773–778. ACM, 2006.
[121] X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns:
a profile-based approach. In Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in data mining, pages 314–
323. ACM, 2005.
[122] H. Yang and C. Liu. A new standard of on-line customer service process:
Integrating language-action into blogs. Computer Standards & Interfaces,
31(1):227–245, 2009.
[123] Y. Yang and J. Pedersen. A comparative study on feature selection in text
categorization. In MACHINE LEARNING-INTERNATIONAL WORK-
SHOP THEN CONFERENCE-, pages 412–420. MORGAN KAUFMANN
PUBLISHERS, INC., 1997.
BIBLIOGRAPHY 161
[124] Y. Yao. On modeling data mining with granular computing. In Com-
puter Software and Applications Conference, 2001. COMPSAC 2001. 25th
Annual International, pages 638–643. IEEE, 2001.
[125] L. Yu and H. Liu. Efficient feature selection via analysis of relevance and
redundancy. The Journal of Machine Learning Research, 5:1205–1224, 2004.
[126] D. Yuan, D. Liu, S. Shen, and P. Yan. Improved semantic retrieval method
based on domain ontology. In Proceedings of the 6th international confer-
ence on Fuzzy systems and knowledge discovery-Volume 1, pages 207–211.
IEEE Press, 2009.
[127] L. Zadeh. Web intelligence and world knowledge-the concept of web iq
(wiq). In Fuzzy Information, 2004. Processing NAFIPS’04. IEEE Annual
Meeting of the, volume 1, pages 1–3. IEEE, 2004.
[128] M. Zaki. Scalable algorithms for association mining. Knowledge and Data
Engineering, IEEE Transactions on, 12(3):372–390, 2000.
[129] M. Zaki. Spade: An efficient algorithm for mining frequent sequences.
Machine Learning, 42(1):31–60, 2001.
[130] M. J. Zaki. Generating non-redundant association rules. In Proceedings of
the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 34–43. ACM, 2000.
162 BIBLIOGRAPHY
[131] L. Zhang, Y. Yu, J. Zhou, C. Lin, and Y. Yang. An enhanced model
for searching in semantic portals. In Proceedings of the 14th international
conference on World Wide Web, page 462. ACM, 2005.
[132] N. Zhong. Representation and construction of ontologies for Web in-
telligence. International Journal of Foundations of Computer Science,
13(4):555–570, 2002.
[133] N. Zhong, Y. Li, and S. Wu. Effective pattern discovery for text mining.
IEEE Transactions on Knowledge and Data Engineering, 24(1):30–44, 2012.
[134] X. Zhou, S. Wu, Y. Li, Y. Xu, R. Lau, and P. Bruza. Utilizing search intent
in topic ontology-based user profile for web mining. In IEEE/WIC/ACM
International Conference on Web Intelligence, 2006. WI 2006, pages 558–
564, 2006.
[135] A. Zhu, J. Pedro, and T. Cunha. Pruning the volterra series for behavioral
modeling of power amplifiers using physical knowledge. Microwave Theory
and Techniques, IEEE Transactions on, 55(5):813–821, 2007.