Post on 22-May-2022
INFORMATION RETRIEVAL ACROSS MULTIPLE
INFORMATION SOURCES USING A KNOWLEDGE BASED
METHODOLOGY
A THESIS
SUBMITTED TO THE DEPARTMENT OF CIVIL AND ENVIRONMENTAL
ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF ENGINEER
Siddharth Taduri
March 2012
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/jx742kr9947
© 2012 by Siddharth S Taduri. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
Approved for the department.
Kincho Law, Adviser
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this thesis in electronicformat. An original signed hard copy of the signature page is on file in University Archives.
iii
iv
ABSTRACT
The recent years have seen a tremendous growth in research and developments in
science and technology, and an emphasis in obtaining Intellectual Property (IP)
protection for one’s innovations. Information pertaining to IP for science and
technology is siloed into many diverse sources and consists of laws, regulations,
patents, court litigations, scientific publications, and more. Although a great deal of
legal and scientific information is now available online, the scattered distribution of
the information, combined with the enormous sizes and complexities, makes any
attempt to gather relevant IP-related information on a specific technology a daunting
task. In this thesis, we develop a knowledge-based software framework to facilitate
retrieval of patents and related information across multiple diverse and uncoordinated
information sources in the US patent system. The document corpus covers issued US
patents, court litigations, scientific publications, and patent file wrappers in the
biomedical technology domain.
A document repository is to be populated with issued US patents, court cases,
scientific publications, and file wrappers in XML format. Parsers are developed to
automatically download documents from the information sources. Additionally, the
parser also extracts metadata and textual content from the downloaded documents and
populates the XML repository. A text index is built over the repository using Apache
Lucene, to facilitate search and retrieval of documents.
v
Based on the document repository, the underlying methodology to search across
multiple information sources in the patent system is discussed. The methodology is
divided into two major parts. First, we develop a knowledge-based query expansion
methodology to tackle domain terminological inconsistencies in the documents.
Relevant knowledge is retrieved from external sources such as domain ontologies.
Since our goal is to retrieve a collection of relevant documents across multiple
sources, we develop a patent system ontology to provide interoperability between the
different types of documents and to facilitate information integration. We discuss the
Information Retrieval (IR) framework which combines the knowledge-based query
expansion methodology with the patent system ontology to provide a multi-domain
search methodology. A visualization tool based on term co-occurrence is developed
that can be used to browse the document repository through class hierarchies of
domain ontologies.
The knowledge-based query expansion methodology is evaluated through formal
measures such as precision and recall. A simple term-based search is used as a
baseline reference for comparison. Additionally, the results from related works are
also used for comparison. A series of common questions asked during patent prior art
searches and infringement analysis are generated to evaluate the patent system
ontology. A summary of the results and analysis is provided.
vi
ACKNOWLEDGEMENTS
First and foremost, I would like to express my deepest thanks to my advisor Prof
Kincho H. Law for providing me with a wonderful opportunity in the form of this
project. His continued support, patience, and belief got me through my graduate
studies at Stanford. He is a role model to me and continues to inspire decisions I make
in life.
I would like to thank Prof. Jay Kesan, School of Law at the University of Illinois
at Urbana-Champaign, and Dr. Gloria Lau, Consulting Associate Professor at Stanford
University, for their constant guidance and comments on this project. Their experience
has been an invaluable resource to me. I would also like to thank my uncle Dr.
Sudarsan Rachuri for helping me make informed career choices over the years.
My stay at Stanford has made me realize the importance of family more than ever
before. I would like to thank all my family, especially my parents, sister, and brother-
in-law whose motivation immensely helped me get here. I wish I could show my work
to my late grandfather, who spoke about technology and entrepreneurship years ago
when I could barely spell the words.
My office mates, former and present members of the Engineering Informatics
Group, have made the long hours spent at the office enjoyable. I would like to thank
Vladimir Fedorov, Zan Chu, Kay Smarsly, Baryam Aygun, and Jinkyoo Park. My
close friends Varun Sheth, Khushnuma Irani, Smit Shah, Gautham Sista, Saurabh
vii
Saraf, Reuben Joseph, Siddharth Ahuja, and Siddharth Kumar have been the closest to
family I have had here and I would like to thank them for their constant
encouragement.
I would also like to thank the university and library staff, especially Kim Vonner,
Brenda Sampson, and Jill Nomura for all the help they have offered. This research is
partially supported by the National Science Foundation, Grant Number 0811460, and
by the Information Technology Laboratory at the National Institute of Standards and
Technology. Any opinions and findings are those of the author, and do not necessarily
reflect the views of the National Science Foundation or the National Institute of
Standards and Technology.
viii
TABLE OF CONTENTS
Abstract…… ................................................................................................................ iv
Acknowledgements ...................................................................................................... vi
Table of Contents ....................................................................................................... viii
List of Tables .............................................................................................................. xiii
List of Figures ............................................................................................................. xv
Chapter 1. Introduction ............................................................................................. 1
1.1 Motivation and Problem Statement ............................................................ 1
1.2 Goals of this Research ................................................................................ 4
1.3 Background and Related Research ............................................................. 6
1.3.1 Background on the Patent System ............................................... 6
1.3.2 Related Work ............................................................................... 7
1.4 Thesis Outline............................................................................................. 9
Chapter 2. Document Repository ............................................................................ 12
ix
2.1 Introduction .............................................................................................. 12
2.2 Use Case ................................................................................................... 14
2.3 Document Collection and Parsing ............................................................ 16
2.3.1 Patents ........................................................................................ 17
2.3.2 Court Cases ................................................................................ 20
2.3.3 Publications ................................................................................ 24
2.3.3.1 Identifying Ground Truth from TREC Corpus .......... 27
2.3.4 File Wrappers ............................................................................. 28
2.4 Evaluation and Accuracy.......................................................................... 32
2.4.1 Evaluation of the Extracted Patent Data .................................... 33
2.5 Text Index................................................................................................. 34
2.5.1 Vector Space Model ................................................................... 35
2.5.2 TF-IDF ....................................................................................... 36
2.5.3 Fields and Schema ..................................................................... 37
2.5.4 Solr ............................................................................................. 38
2.6 Related Work ............................................................................................ 38
2.6.1 Interoperability, Information Frameworks and Semantic
Web ............................................................................................ 39
x
2.6.2 Digital Repositories ................................................................... 39
2.6.3 Document Parsing and Information Extraction ......................... 40
Chapter 3. Methodology ........................................................................................... 42
3.1 Introduction .............................................................................................. 42
3.2 Bio-Ontologies ......................................................................................... 45
3.2.1 Query Expansion: General Form ............................................... 50
3.2.2 Effects of choosing the right ontology ....................................... 54
3.2.3 Effects of Indexing Parameters .................................................. 56
3.2.4 Scope of the Query Terms ......................................................... 58
3.2.5 Interactive Model for Visualization ........................................... 59
3.3 Patent Ontology ........................................................................................ 60
3.3.1 Defining Scope of the Ontology ................................................ 62
3.3.2 Conceptualization ...................................................................... 65
3.3.3 Populating the Ontology ............................................................ 70
3.3.4 Using the Declarative Syntax: Expressing Queries and
Developing Rules ....................................................................... 71
3.3.4.1 Expressing Competency Questions as
SPARQL queries ....................................................... 71
3.3.4.2 Expressing Heuristics as Rules .................................. 72
xi
3.4 IR Framework........................................................................................... 74
3.4.1 Implementation Details .............................................................. 78
3.5 Related Work ............................................................................................ 79
3.5.1 Knowledge-Based IR ................................................................. 80
3.5.2 Other Approaches to IR ............................................................. 80
3.5.3 Ontology Development and Interoperability ............................. 81
Chapter 4. Performance Evaluation ....................................................................... 83
4.1 Introduction .............................................................................................. 83
4.2 Background and Related Work ................................................................ 84
4.2.1 Evaluation Metrics ..................................................................... 84
4.2.2 SPARQL .................................................................................... 85
4.3 Knowledge-Based Methodology using Bio-Ontologies........................... 86
4.3.1 Baseline ...................................................................................... 87
4.3.2 Query Expansion ........................................................................ 89
4.3.2.1 Query Expansion for Retrieval of Patent
Documents ................................................................. 91
4.3.2.2 Query Expansion for Retrieval of Scientific
Publications ............................................................... 95
4.4 Evaluating Patent System Ontology and IR Framework........................ 101
xii
4.4.1 Use Case Scenario: Patent Prior Art Search ............................ 102
4.4.2 Use Case Scenario: File Wrapper Example ............................. 107
4.4.3 Other Benefits of the Patent System Ontology ........................ 112
4.5 Summary ................................................................................................ 113
Chapter 5. Conclusion and Future Work ............................................................. 116
5.1 Summary ................................................................................................ 116
5.2 Future Work ........................................................................................... 118
5.2.1 Digital Repositories ................................................................. 118
5.2.2 User Relevancy Feedback ........................................................ 119
5.2.3 Query Expansion, Semantic Indexing and Other
Methodologies .......................................................................... 120
5.2.4 Scaling to More Applications, More Data Sources, and
More Subject Domains ............................................................ 121
Bibliography .............................................................................................................. 122
xiii
LIST OF TABLES
Number Page
Table 2.1: Patent XML Element Descriptions ....................................................... 21
Table 2.2: Field-by-Field Accuracy of Extracted Patent Data .............................. 34
Table 3.1: Summary of the Selected Biomedical Ontologies ................................ 47
Table 3.2: Effect of the Distance between Search Clauses ................................... 59
Table 3.3: Expressing Competency Questions in SPARQL .................................. 72
Table 3.4: Expressing SWRL rules ....................................................................... 73
Table 4.1: Baseline Reference: Rank of Core Patents ........................................... 88
Table 4.2: Baseline Reference for Evaluating the Query Expansion
Methodology ......................................................................................... 89
Table 4.3: Change in Average Rank of Core Patents with Level of
Expansion ............................................................................................. 93
Table 4.4: Precision and Average Rank of Core Patents for Fielded
Search on Patent Documents ................................................................ 95
xiv
Table 4.5: Pre-Processed Queries to Evaluate Query Expansion on
Scientific Publications .......................................................................... 97
Table 4.6: Precision for Results Obtained by Querying Patent System
Ontology for Documents Related to a Set of Inventors,
Assignees or US Classification .......................................................... 106
xv
LIST OF FIGURES
Number Page
Figure 2.1: Sample Patent Document ..................................................................... 19
Figure 2.2: Sample Patent XML Document ............................................................ 20
Figure 2.3: Sample Court Case Document .............................................................. 23
Figure 2.4: Sample Court Case XML Document .................................................... 24
Figure 2.5: Sample Publication in XML ................................................................. 26
Figure 2.6: Contents of a File Wrapper ................................................................... 30
Figure 2.7: Sample Rejection Letter (Office Action) ............................................. 31
Figure 2.8: Sample Interference Document ............................................................ 32
Figure 2.9: Sample File Wrapper in XML .............................................................. 33
Figure 2.10: Cosine Similarity in VSM .................................................................... 36
Figure 3.1: The Importance of Domain Knowledge in Retrieving
Scientific Publications .......................................................................... 48
xvi
Figure 3.2: The Importance of Domain Knowledge in Retrieving Patent
Documents ............................................................................................ 49
Figure 3.3: Query Expansion along MeSH Hierarchy to Retrieve Relevant
Documents ............................................................................................ 51
Figure 3.4: Relations in Domain Ontologies .......................................................... 52
Figure 3.5: General Form of the Expanded Query .................................................. 54
Figure 3.6: Comparison between Multiple Biomedical Ontologies ....................... 55
Figure 3.7: Visualizing Concept Co-occurrences using MINOE ........................... 60
Figure 3.8: Conceptual View of Patent Documents ................................................ 66
Figure 3.9: Conceptual View of Court Case ........................................................... 66
Figure 3.10: Events Contained in a File Wrapper ..................................................... 67
Figure 3.11: Excerpt from the Patent System Ontology: Rejection class ................. 68
Figure 3.12: Top Level Ontology for the Patent System .......................................... 69
Figure 3.13: Cross-Referencing between Documents in the Patent System ............. 69
Figure 3.14: Populating the Patent System Ontology ............................................... 70
Figure 3.15: Expressing Heuristics through Rules in Patent System
Ontology ............................................................................................... 74
Figure 3.16: IR Framework ....................................................................................... 75
Figure 3.17: Example to Illustrate IR Framework .................................................... 77
xvii
Figure 3.18: Current Implementation of the IR Framework Methodology .............. 79
Figure 4.1: Average Precision and Recall for Query Expansions on Patent
Documents ............................................................................................ 91
Figure 4.2: Comparison between use of Multiple Ontologies vs.
Individual Ontologies ........................................................................... 94
Figure 4.3: Effect of Depth of Query Expansion on Retrieval of Scientific
Publications .......................................................................................... 98
Figure 4.4: Performance of Query Expansion on Individual Topics ...................... 98
Figure 4.5: Number of Query Terms with Increasing Depth of Query
Expansion ............................................................................................. 99
Figure 4.6: SPARQL Query to Retrieve Court Cases Related to
Erythropoietin ..................................................................................... 103
Figure 4.7: SPARQL Query to Retrieve Patents Involved in Court Cases
Related to Erythropoietin ................................................................... 104
Figure 4.8: SPARQL Query to Extract US Patent Classification, Names of
Assignees and Inventors from Patent Documents .............................. 105
Figure 4.9: SPARQL Query to Extract Patent Documents Related to a Set
of Inventors, Assignees and/or US Patent Classification ................... 106
Figure 4.10: Querying Patent System Ontology for Backward Citations ............... 107
Figure 4.11: SPARQL Query to Display Contents of a File Wrapper,
Ordered by the Date ............................................................................ 108
xviii
Figure 4.12: SPARQL Query to Extract the Text of Claims from the
Original Patent Application ................................................................ 109
Figure 4.13: Class View of Patent Examiner’s Restriction in File Wrapper
for US Patent 5,955,422 ..................................................................... 110
Figure 4.14: Example to Illustrate a Simple Rule-Based Similarity Measure ........ 112
Chapter 1.
INTRODUCTION
1.1 MOTIVATION AND PROBLEM STATEMENT
The recent years have seen a tremendous growth in research and developments in
science and technology, and an emphasis in obtaining Intellectual Property (IP)
protection for one’s innovations. IPs are important assets of any organization. In a
study of over 9000 European patents between 1993 and 1997, the median value of a
patent was estimated to be EUR 300,000 with 10% of the owners reporting a value of
EUR 10 mil. or more.1 Clearly, any company or inventor would want to protect the
rights to use, make, or sell their invention. During the lifetime of a patent, from its
initial filing, patent issuance to disputes and litigations, the patent system will
constantly be searched for information. Information pertaining to IP and the patent
system for science and technology is siloed into many diverse sources and consists of
laws, regulations, patents, court litigations, scientific publications, and more. Although
a great deal of legal and scientific information is now available online, the scattered
distribution of the information, combined with the enormous sizes and complexities,
makes any attempt to gather relevant IP-related information on a specific technology a
1 Study on Evaluating the Knowledge Economy – What are Patents Actually Worth?
http://ec.europa.eu/internal_market/indprop/docs/patent/studies/patentstudy-report_en.pdf
(Accessed on 03/01/2012).
CHAPTER 1. INTRODUCTION 2
daunting task. Currently, the task of gathering IP-related information is performed
manually and is both laborious and expensive. This falls disproportionally on smaller
firms, start-ups, and individual inventors who have very limited resources. In this
thesis, we develop a methodology to facilitate retrieval of patents and related
information across multiple diverse and uncoordinated information sources in the US
patent system. The following scenarios illustrate some of the issues faced with the
current patent system:
A company looking to patent its technology on medical imaging devices, for
example, is required to perform an initial patentability search and establish the
usefulness, novelty, and non-obviousness of the technology [1]. The
patentability search involves a thorough study of prior art including scientific
literature and patent databases, competitor analysis, existing litigations to
similar technologies, and regulations issued by government agencies such as
the Federal Drug Agency (FDA) (or any agency enforcing laws with respect to
medical imaging devices and related technologies).
Similar to the patent applicant, a patent examiner performs patentability search
when examining an application. As of 2009, the United States Patent and
Trademark Office (USPTO) employs about 6,242 patent examiners and
received over 456,106 utility patent applications.2 Roughly, this translates to
around 73 patents per examiner annually and approximately 1.5 patents per
examiner per week. Although a patent examiner is generally well-versed with
the technological domain of the patent application, the situation imposes a
serious time constraint during the review process. Hence, each application
receives lesser time and potentially leads to incomplete examination, and
possibly infringement or invalidation at a later stage.
2 The USPTO’s annual statistics can be accessed at
http://www.uspto.gov/web/offices/ac/ido/oeip/taf/reports.htm (Accessed on 03/01/2012)
CHAPTER 1. INTRODUCTION 3
To protect IPs, companies may perform an infringement analysis to ensure that
a particular patent’s right is not being infringed. The consequences of an
infringement can be severe and result in heavy losses. For example, Microsoft
Inc. made a settlement of US $521 mil. to Eolas Inc., over a single patent in
2007.3 Other notable settlements include the litigations between Medtronics
and Michelson (US $1.57 billion)4, and Kodak and Polaroid (US $909 mil)
5.
Infringement analysis involves a thorough search in the issued patents
database, patent application database, prior court litigations, regulations, and
any form of documented evidence to help assert the infringement or invalidate
a patent’s claim as a defensive measure.
Irrespective of the scenario, whether a company intends to patent its technology or
to perform an infringement analysis, or a patent examiner intends to perform a
patentability search, several questions arise:
What are the issued patents in related technologies?
What is the legal scope of similar patents?
Who are the competitors?
Have any similar patents been challenged in court?
How can one work around existing body of knowledge?
Are there any scientific literatures, or regulations which can potentially be used
to challenge and to invalidate a patent’s claims?
3 The details of the settlement between Eolas Inc. and Microsoft Inc. can be viewed at
http://en.wikipedia.org/wiki/Eolas (Accessed on 03/01/2012). 4 The details of the settlement between Medtronics and Michelson can be viewed at
http://www.nytimes.com/2005/04/23/business/23medronic.html (Accessed on 03/01/2012). 5 The details of the settlement between Kodak and Polaroid can be viewed at
http://articles.latimes.com/1990-10-13/business/fi-1997_1_instant-photography (Accessed on
03/01/2012)
CHAPTER 1. INTRODUCTION 4
These questions cannot be answered from any single information source. An
integration framework is needed to enable the retrieval of relevant information from
diverse sources. In this thesis, we explore a knowledge-based approach to address two
fundamental information integration issues – (a) the lack of interoperability among the
information sources in the current patent system; and (b) the varying information
needs by the users of the patent system. The work presented will positively impact
small businesses and independent researchers, such as lawyers and patent examiners,
with a potential to influence the use of IT in the current patent system.
1.2 GOALS OF THIS RESEARCH
The objective of this research is to develop a methodology that can facilitate the
retrieval of patent related information from heterogeneous sources. To limit the scope
of the research, we focus on the technology space in biomedicine; i.e. patents,
scientific publications, laws and regulations that broadly fall under the area of
biomedicine. The heterogeneous nature of information sources results in different
language conventions, terminology, and publication formats etc.. In fact, the
documents belonging to the various information sources are almost entirely different.
In order to understand the challenges that are currently faced in gathering and
searching the patent system, our first step is to construct a document repository. We
study the current publication formats, structure, and style of language and identify the
critical elements in each information source. Our corpus consists of patent documents,
court litigations and scientific publications related to a biomedical use case –
‘erythropoietin’. The corpus also includes a patent file wrapper which is a collection
of all documents and communication between the patent applicant and the patent
office during the application phase of a patent. Since XML is a stable format to
represent structured and semi-structured information, the documents are appropriately
parsed and stored as XML files. The repository is made searchable via the use of
Apache Lucene; a text mining library [5].
CHAPTER 1. INTRODUCTION 5
Terminological variations such as synonymy and polysemy are a common source
of problems which often hinder the effectiveness of traditional term based Information
Retrieval (IR) methods. We develop a knowledge-based method that uses external
knowledge sources such as domain ontologies to provide the required semantics to
resolve terminological inconsistencies and improve semantic interoperability between
information sources. The study examines current trends, sources, and applications of
domain ontologies. While the primary focus of this research is the use of biomedical
ontologies, multiple domain ontologies spanning both legal and technical domains are
needed in order to achieve information interoperability.
An important step in achieving interoperability is to allow the information sources
to communicate with one another. To achieve this, the information sources must use a
standardized and structured representation for documents. We develop a Patent
System Ontology (PSO) to standardize the representation of the information sources
and achieve interoperability. While the documents are vastly diverse, the information
is implicitly cross-referenced, that can be used as relevancy measures between
documents. For example, a court document which involves a particular patent
document reveals a high relevancy between the two documents. Such relevancy
measures are central to our method for multi-source IR and will be discussed in detail
in this thesis.
We design an Information Retrieval (IR) framework which integrates the patent
system ontology and the domain ontologies to retrieve a set of related documents
across multiple sources in an iterative manner. Since the potential user base can range
from lawyers to technical professionals, understanding the user’s intent from a single
query becomes challenging. We incorporate user feedback into the framework in order
to capture the user’s true information needs.
CHAPTER 1. INTRODUCTION 6
A fully-functional tool is developed based on the proposed IR framework. We
discuss the requirements of such a tool in detail, and include several features in order
to provide good user experience. These features include faceting, tag clouds, co-
occurrence graphs and so on. We also extend the visualization module of MINOE, a
software tool originally developed to explore regulations in ocean ecosystems [37], to
browse the document repository through the hierarchies of biomedical ontologies.
The research primarily revolves around the use of knowledge based
methodologies and information modeling. IR strongly relies on natural language
understanding, data and text mining, and machine learning which are fast evolving and
providing promising results. This research provides plenty of scope to incorporate
such emerging techniques into the framework to further enhance the quality of the
results.
1.3 BACKGROUND AND RELATED RESEARCH
1.3.1 BACKGROUND ON THE PATENT SYSTEM
This section provides some very basic, but necessary background on the patent
system. The patent system is a two stage system where the first stage includes the
acquisition of patents, and the second includes their enforcement. In the acquisition
phase, a patent application is prosecuted by the USPTO and finally issued or rejected
based on the patent examiner’s decision. The prosecution history, also known as the
file wrapper, is documented for that issued patent or application. The various
documents involved in the acquisition phase are the patent applications, file wrappers,
issued patents and any form of prior art such as scientific publications.
The enforcement stage of the patent system comes into play once the patent is
issued. In case of infringement of patent claims, the infringer of a patent can be tried
in court in a patent litigation. The enforcement stage revisits the steps taken in
CHAPTER 1. INTRODUCTION 7
acquisition stage, and can invalidate an entire patent based on its findings. The
documents involved in the enforcement stage include patent applications, issued
patents, file wrappers, court cases, other forms of prior art including scientific
publications, and appropriate chapters of the United States Code (U.S.C.) and the
Code of Federal Regulations (C.F.R.).
1.3.2 RELATED WORK
The patent system contains a wealth of technology related information, distributed
under a regulatory system. Many government agencies are moving toward digital
libraries to publish and archive information [133]. Information Technology (IT) is
becoming indispensable to government and to facilitate access of government data.
Continuing development of IR methodologies is necessary to keep up with the
information growth. Furthermore, with information being created and managed by
different organizations and agencies, establishing interoperability between the
information is essential [56,57]. Since the patent system covers technical and legal
domains and involves a variety of information sources, a thorough literature review
would require studying the current state-of-the-art IR methods for each source, and in
each domain. Our attempt is to abstract methodologies and recent related works that
are most applicable to facilitating IR in the patent system.
Almost all information sources and documents contain metadata. For example, all
documents have a Title, a Date etc.. The metadata is rather generic and is not tied to a
specific domain. For this reason, IR methods such as link analysis, citation analysis,
bibliographic ranking, and other metadata related approaches are commonly used in
both legal texts and technical texts [43,79,100]. However, such IR mechanisms based
simply on metadata are not sufficient and are typically used in conjunction with term-
based IR models such as the Vector Space Model (VSM) [111]. The VSM, which
represents each unique word in a corpus as a separate dimension, suffers from the
CHAPTER 1. INTRODUCTION 8
‘curse-of-dimensionality’ in the sense that the high number of dimensions causes data
sparseness and other computational issues [84]. To overcome the ‘curse-of-
dimensionality’ of the VSM, Latent Semantic Indexing (LSI) and its variants such as
the Probabilistic LSI have gained interest in both legal and technical IR communities
[8,29,60,84,135]. Other alternative models include the Okapi Probabilistic Retrieval
Framework and the Divergence From Randomness (DFR) probabilistic model
[4,97,108].
The combination of the large amount of information with the lack of standard
terminology among the diverse information sources renders simple term-based models
ineffective. Simple term based models are not sufficient to capture user context and
information needs. LSI attempts to capture the relationships between terms in similar
context, addressing issues such as synonymy and polysemy, but the method is not
sufficient to capture the domain content of the information. Domain knowledge,
created by experts of a specific domain, can be valuable to conceptualize and express
the semantics, i.e. the meaning of terms and their relationships. Thus, newer methods
often incorporate the semantics of a domain through external knowledge sources such
as ontologies, taxonomies, vocabularies and thesauri [35,88]. Knowledge-based
approaches are commonly applied to technical information sources such as
publications [35], and are slowly moving into the legal domain as well. In fact, legal
ontologies and general language ontologies such as WordNet are becoming popular in
IR applications [39].
Most research in IR has focused on a single domain (such as biomedicine) and a
single information source (such as scientific publications). The nature of the problem
we are addressing demands that information be retrieved from multiple sources and
domains. The problem of IR from multiple diverse information sources has been
studied [8,9,84]. In order to facilitate multi-source IR, both structural and semantic
interoperability are required [115,131]. Semantic interoperability can be achieved
CHAPTER 1. INTRODUCTION 9
through the use of several domains, legal and general purpose ontologies [57,103].
Studies have made use of top level ontologies to provide structural interoperability
between sources [131].
Automatic ontology learning techniques aim to extract relevant concepts and their
relationships from a corpus. Recent interest in ontologies has led to research in the
area of automatic ontology learning [103,139]. Unlike biomedicine, most technical
domains lack domain ontologies that sufficiently cover the sub-topics. Hence,
automatic ontology learning techniques can be potentially used to learn concepts and
relationships where domain knowledge is sparse.
Patent and patent statistics provide valuable information. For example, they can be
used as indicators to study technological growth and change, knowledge flows,
estimating position of companies and organizations in a technological space, and so on
[52]. Similarly, data analysis, data mining, and machine learning are relevant in IR
research. Some of the more important methodologies include NLP techniques such as
feature extraction and statistical parsing [14,40,75]. Feature extraction has been used
to identify concepts such as genes, person names, etc., that can provide important
information that can be incorporated in search mechanisms. Although it is believed
that NLP techniques cannot be easily applied to legal texts [21], studies are attempting
to parse patent claims and other legal corpora to extract important phrases, extract
dependencies between terms, and facilitate machine translation etc. [114].
1.4 THESIS OUTLINE
In this research, we address the problem faced in accessing information across
multiple diverse yet related sources in the US patent system. We discuss our
methodology to improve information source interoperability through the use of
ontologies. We present an IR framework to improve retrieval of related documents
from the patent system. The work also presents a preliminary study on user relevancy
CHAPTER 1. INTRODUCTION 10
and user experience through the implementation of a fully functional tool built upon
the IR framework. The thesis is organized as follows:
Chapter 2 discusses the development of the document repository. We
specifically deal with four types of documents – (1) issued patents; (2) court
cases; (3) patent file wrappers; and (4) scientific publications. We present a
detailed description of the challenges faced in gathering information, current
state-of-the-art tools to access and parse relevant information from the
documents. A thorough evaluation of the document repository is provided to
ensure the accuracy of the parsed information. The chapter also discusses the
text indexes and schemas which are used throughout this study for inspection
and analysis.
Chapter 3 discusses our methodology in three parts. The first part explores the
use of domain ontologies to tackle terminological inconsistencies and
incorporate domain specific semantics to improve access and retrieval within a
single information source. The effects of ontology selection, document
structure, and indexing schemes on the methodology are explained. The second
part discusses our Patent System Ontology (PSO), designed to improve
structural interoperability between information sources. One of the key
contributions of the PSO is the ability to express cross-referenced information
and user heuristics using declarative syntax. In the third part, we combine the
application of domain ontologies and the PSO to illustrate a powerful
methodology for improving information access and retrieval in the patent
system. We briefly discuss user relevancy feedback techniques and attempt to
illustrate the importance of good user experience through a well-designed tool.
Chapter 4 presents an evaluation and analysis of our methodology. Through
several use case scenarios, the practical applications and the potential impacts
of this multi-disciplinary research are discussed. The analysis provides a solid
CHAPTER 1. INTRODUCTION 11
foundation for potential future work and studies the requirements to develop a
valuable exploratory tool.
Chapter 5 summarizes the contents of this thesis and discusses the broader
impacts of the research.
Chapter 2.
DOCUMENT REPOSITORY
2.1 INTRODUCTION
The complexity of the patent system makes retrieval of relevant information a
challenging task. The existence of multiple information silos results in heterogeneity
in almost every aspect of the documents – in structure, in semantics, in format and in
system [115]. Information pertaining to one type of documents may be available from
several sources within the same silo. For example, the information silo representing
scientific literature could comprise of repositories such as IEEE-xplore, ACM, and
PubMed etc. [2,64,105]. Similarly, patent documents can be accessed from multiple
sources such as USPTO and Google Patents [50,129]. Modern day applications
demand a high amount of integration between information sources to facilitate cross-
domain Information Retrieval (IR). As explained in Chapter 1, there is a lack of a
standardized framework that facilitates information integration and the development of
tools to improve accessibility and retrieval of documents. The first step towards
developing such a framework involves studying the state-of-the-art publishing
standards and accessibility tools. This is best learnt by constructing a representative
document repository which includes the diverse information sources that we are
addressing. In this chapter, we describe the development of the document repository
CHAPTER 2. DOCUMENT REPOSITORY 13
from various information silos in the patent system which will also serve as our
experimental data set for the development and evaluation of our IR framework.
Our main goal is to develop a document repository which contains a collection of
related documents that encompasses – (a) issued patent documents; (b) scientific
publications; (c) court litigations and (d) USPTO file wrappers. To our knowledge,
currently no such data set spanning the patent system readily exists. The chapter
presents a thorough discussion exposing the structural diversity, inconsistencies in
publication standards, and accessibility of these information sources and lays the
foundation for the development of our methodology in discussed Chapter 3.
The rest of the chapter is organized as follows: Section 2.2 describes our use case
and the contents of the repository. Section 2.3 discusses the challenges associated with
interfacing and accessing the information sources and our methodology to collect the
documents. The documents are typically lengthy and contain large amount of
information. In practice, applications seldom use the entire content of the document.
By discussing examples of end user applications, relevant portions of the documents
are identified. We parse this information and convert the documents to a common
structured format. We choose XML to store the parsed information due to the
abundance of supporting software libraries and parsing tools.6 Section 2.4 presents a
formal evaluation of the data set to ensure usability. The document repository is
implemented using Apache Lucene, a widely used text mining library. Using a Java
interface, the XML files created in Section 2.3 are used to build and search the
document repository. Section 2.5 describes the text indexes implemented using
Lucene, which serve as the basis for the preliminary evaluations of our methodology.
Section 2.6 summarizes related work and discusses potential future directions.
6 For a list of XML parsers in Java, see – http://en.wikipedia.org/wiki/Java_XML (Accessed on
03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 14
2.2 USE CASE
Recent advances in the biomedical domain have led to the creation of several
external and manually curated knowledge-bases and ontologies, far more than most
other disciplines. This prompted us to choose our use case in the biomedical domain,
since it will give us immediate access to the existing knowledge to implement our
knowledge-based approach. These recent advancements have also reflected in the
patent system, evident with the increased number of patent applications, scientific
publications, and court activities. For example, in 2010, among the 219,614 patents
that were granted by the USPTO, 21,840 patents were roughly classified as chemical
patents (an increase of 5,672 patents from the previous year)7; MEDLINE, a database
which consists of over 21 million citations from over 5000 journals, has accepted 38
new journal titles between June 2011 and October 20118.
We build the document repository around the concept – erythropoietin; a hormone
responsible for the production of red blood cells in the human body. The synthetic
production of erythropoietin has led to the treatment of chronic diseases such as
anemia. Epogen - the production brand of synthetic erythropoietin manufactured by
the pharmaceutical giant Amgen Inc. is protected by five core patents namely – US
5,547,933, US 5,618,698, US 5,621,080, US 5,756,349, and US 5,955,422. These
patents have been central to many related court cases involving other pharmaceutical
companies such as Hoescht Marion Roussel and Transkaryotic Therapies, and heavily
cite scientific literature from top journals.
In order to compensate for terminological inconsistencies caused due to
synonymy, hyponymy, and abbreviations etc., we identified 43 concepts related to
7 USPTO statistics can be accessed at
http://www.uspto.gov/web/offices/ac/ido/oeip/taf/stchem.htm (Accessed on 03/01/2012). 8 Information regarding newly accepted journal titles in MEDLINE can be accessed at -
http://www.nlm.nih.gov/bsd/lstrc/new_titles.html (Accessed on 03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 15
erythropoietin (“the 43 concepts” hereafter) by searching bio-ontologies for
synonyms, subclasses, and super-classes in BioPortal [95]. As of January 2010, we
downloaded the top 50-100 documents for each of the 43 concepts from the USPTO
issued patent database, collecting a total of 1150 patent documents. The 1150 patent
documents cover various aspects of erythropoietin such as the production of
erythropoietin, its usage, and related procedures, etc.. Hence, using the 43 concepts to
gather the data set provides us with a data set broad enough within the erythropoietin
use case. Among these 1150 patent documents, we identified 135 highly relevant
patent documents (“the 135 patents” hereafter) by following the forward and the
backward citations to the five core patents. The 135 patents will serve as our ground
truth for any experiments that are to follow.
In order to gather court documents, we searched several court litigations, dated
back to the 1980’s. The repository contains 30 court documents (“the 30 court
documents” hereafter) which directly or indirectly involve Amgen Inc. and the five
core patents. This search was performed using erythropoietin and the 43 concepts on
Google Scholar [51].
PubMed is a comprehensive index of over 5000 biomedical journals indexing over
20 million MEDLINE citations [85,105]. In building our approach towards multiple
information source retrieval, we also wish to study the application of biomedical
ontologies in each individual document domain. For this purpose, we would like to
have a comprehensive biomedical dataset that we can experiment on. The Text
Retrieval Conference (TREC) organized by the National Institute of Standards and
Technology (NIST) is a well-known and prestigious competition that produces high
quality datasets every year.9 The TREC 2007 genomics data set consists of over
9 Information regarding the Text Retrieval Conference (TREC) can be accessed at -
http://trec.nist.gov/ (Accessed on 03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 16
162,000 scientific publications from 49 prominent biomedical journals.10
The data set
provides a well-defined ground truth for experimentation with around 36 topics
representing varying information needs. However, in building the dataset for our use
case of multi-domain retrieval, we must first identify documents related to the use
case. We listed over 3000 publications (“the 3000+ publications” hereafter) following
citations from the 135 patents as the ground truth in the publication domain. Out of the
3000+ publications, we identified around 1737 publications in the TREC dataset.
Section 2.3.3 provides a more detailed explanation of how this mapping is made.
A patent document is the outcome of years of negotiations between the patent
office and the applicant. All the negotiations, including the original application, office
actions, amendments, and the final issued patent document are bundled together in a
file history or a file wrapper. File wrappers provide very detailed information of the
patent including the original claims, the final claims, the added and deleted citations
etc.. Such information is very critical in defining the scope and validity of the patent,
especially during the litigation or enforcement phase. Due to logistics involved in
gathering file wrappers (described in detail in Section 2.3.4), currently our corpus
includes only one file wrapper for the core patent U.S. 5,955,422.
All in all, this document repository represents the unique problem in IR involving
multiple information sources in the patent system and provides an experimental
platform for developing and evaluating our methodology.
2.3 DOCUMENT COLLECTION AND PARSING
A quick study of the information sources reveals several inconsistencies in terms
of publication standards, document structure, and accessibility. For example – (1)
10 The 2007 TREC Genomics track can be accessed at -
http://ir.ohsu.edu/genomics/2007data.html (Accessed on 03/12/2012).
CHAPTER 2. DOCUMENT REPOSITORY 17
PubMed provides scientific publications in well-defined XML files while USPTO
provides issued patents as HTML files; (2) PubMed provides APIs or web services to
access information while USPTO still lacks such interfaces; and so on. In this section,
we explore available sources for documents, their publication standards and available
web-services to programmatically access data. Whether we deal with patents or
scientific publications, each document contains a wealth of information. Applications
seldom use the entire information available in a document, but rather use specific and
much smaller portions such as the metadata, or simply the title and so on. When
viewed from an application’s (or user’s) point of view, it becomes clear what aspects
of the documents are crucial. This gives us an estimate as to how much metadata and
textual information we need to parse from the documents to make the data set useful.
We study the structure of the documents and develop parsers to extract information
from the documents. We place the extracted data from the documents in well-defined
XML files in order to maintain a consistent format.
2.3.1 PATENTS
There are over 41 different patent issuing authorities across the world, including
the European, Japanese, and German Patent Offices [38,44,68]. The Derwent World
Patents Index (DWPI) is one of the largest patent databases with documents indexed
from 41 patent-issuing authorities [35]. HeinOnline, LexisNexis and WestLaw are
other libraries for IP related legal information [58,78,134]. Google now makes all
USPTO products freely available online [49]. Thomson Innovation and Dialog LLC
provide tools to help in information mining of patent documents and other scientific
literature through services such as Delphion and Web of Science [123,124]. Our
current focus involves only the USPTO. The USPTO maintains a public database for
issued patents, patent applications, copyrights, and trademarks. There are currently
over 7 million patents issued in the US. In 2009 alone, 485,312 patent applications
were filed with the USPTO. Proprietary websites like Delphion do not allow
CHAPTER 2. DOCUMENT REPOSITORY 18
automated downloading or crawling of patent documents. Moreover, since any new
document is first published on USPTO, we stick to using the USPTO as our source for
patent documents.
To our knowledge, USPTO does not provide a standard API to access and
download documents. However, patents can be downloaded using a simple script
based on wget11
as HTML pages. Currently, we do not download images or figures as
our methodology primarily focuses on text. It must be noted that USPTO maintains
full-text for only documents post 1973. If necessary, patents issued prior to 1973 are
available but only as image files. The wget script we developed takes two forms of
input – (a) a list of patent numbers we wish to download; and (b) a list of keywords.
We manually downloaded the five core patents, and generated the list of patents we
ultimately wish to download by parsing the backward and forward citations. This gave
us a list of the 135 patents which form the ground truth for the use case. Next, we use
the 43 concepts as a list input to the script and downloaded the top 50-100 documents
for each concept. Including the 135 patents, the script downloaded a total of 1150
patent documents. Upon downloading the HTML files, the full-text is parsed and
stripped of all HTML tags using available HTML parsers.12
A patent document is basically a combination of two distinct sections; one that is
entirely technical and the other that is entirely legal. Applications dealing with patent
documents have specific requirements that deal with smaller portions of the
documents. For example, a common strategy involves filtering documents based on
only the abstract and technology class until a manageable list of a few hundred
documents is obtained [113]. At this stage, the claims and full technical description
may be referred to with more importance. Patent claim invalidations strongly
11 Wget is a tool to retrieve information from web servers - http://www.gnu.org/software/wget/
(Accessed on 03/01/2012). 12
The HTML parser used in our research can be downloaded from -
http://htmlparser.sourceforge.net/ (Accessed on 03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 19
emphasize on the claims, and the limitations and the priority dates. Citations including
both patents and scientific literature can hold important information. Infringement
analysis often requires specific information such as priority dates, application
information etc.. Additionally, patent documents consist of valuable metadata
information such as inventors, assignees, technology classifications, etc., which can
act as filters to quickly narrow the search to appropriate results.
Although the patent documents are not explicitly marked up, fields in the
documents are clearly defined with section headers (see Figure 2.1). Using these
section headers as markers, we carefully coded a regular expression based script to
parse and extract various fields. Since the documents we are parsing spread over
several years, there are subtle variations in the documents that can cause parsing
inaccuracies. Moreover, some documents have information that is missing in others.
For example, some documents contain an additional ‘Assistant Examiner’ field. This
requires a regular expression, or a set of regular expressions which can handle
United States Patent 5,955,422: “Production of Erthropoietin”
September 21, 1999
Abstract Disclosed are novel polypeptides possessing part or all of the primary structural
conformation … of mammalian erythropoietin ("EPO") … polynucleotides in a
heterologous cellular or viral sample prepared from, e.g., DNA present in a
plasmid or viral-borne cDNA or genomic DNA "library“…
Inventors: Lin; Fu-Kuen (Thousand Oaks, CA)
Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA)
…
Claims 1. A pharmaceutical composition comprising a therapeutically
effective amount of human erythropoietin …
Description ….
Figure 2.1: Sample Patent Document
CHAPTER 2. DOCUMENT REPOSITORY 20
multiple such cases. The regular expressions are implemented in both Java and Perl.
Once the HTML tags are stripped out using a standard HTML parser, the text
information is passed as an input to the script, which extracts the information and
converts it into a fully marked up XML document. Figure 2.2 shows a sample patent
in the resulting XML format. Since patent documents are very lengthy, only a small
portion of the document is shown in the figure. A full list of the extracted fields is
displayed in Table 2.1.13,14
2.3.2 COURT CASES
Court documents can be obtained from several sources. Public Access to Court
13 The International Patent Classification system codes can be accessed at -
http://www.wipo.int/classifications/ipc/en/ (Accessed on 03/01/2012). 14
The United States Patent Classification codes can be accessed at -
http://www.uspto.gov/web/patents/classification/ (Accessed on 03/01/2012).
<Patent> <Title>Production of erythropoietin</Title>
<Assignee>Kirin-Amgen, Inc.</Assignee>
…. <Inventor>Lin Fu-Kuen</Inventor>
…. <Citation>3033753</Citation>
<InwardCitation>7645898</InwardCitation>
<InwardCitation>7645733</InwardCitation>
…. <Pub> The Polycythemias: Diagnosis and Treatment, </Pub>
….
<Claim> A process for the preparation of an in vivo biologically active
erythropoietin product comprising the steps of…
</Claim>
….
</Patent>
Figure 2.2: Sample Patent XML Document
CHAPTER 2. DOCUMENT REPOSITORY 21
Electronic Records (PACER) is an electronic system to access the databases of the 94
District Courts and 13 Courts of Appeals (CAFC) [99]. PACER is an initiative toward
developing a centralized system for accessing court data and contains the most
updated information. DocketX is a privately owned company which has taken up the
task to converting all PACER documents into full text [34]. Their services are
currently available under a paid subscription. Other sources for case documents
include LexisNexis, WestLaw, and Google Scholar which may also provide additional
supporting materials such as case analyses etc. [51,78,134].
Unlike USPTO, PACER has several challenges which make it hard to
automatically fetch documents and docket information.15
Firstly, PACER does not
provide a keyword based search. Documents must be searched using specific metadata
15 A court docket is the official summary of the proceedings of a case.
Table 2.1: Patent XML Element Descriptions
Field Description
Patent Number Unique document identifier provided by the USPTO
Date of Issue The date from which the patent is considered active
Inventor The Inventor of the patent
Inventor Location The location is often used in knowledge transfer research
studies [67]
Assignee The individual who or company which owns the patent
Assignee Location Location of the patent owner
Title The title of the patent document
Abstract The abstract of the patent document
Examiner The examiner who examined the patent application
IPC Classification Technology class as per the International Patent
Classification system
US Classification Technology class as per the USPTO classification system
Claims Statements indicating the legal scope of the invention
Technical Description This field contains the remaining portion of the patent
document.
CHAPTER 2. DOCUMENT REPOSITORY 22
such as the party involved, case numbers or case types.16
Secondly, the case
documents are available in image form and sometimes very illegible. This makes full-
text extraction very cumbersome as an immense amount of time must be spent on
manual curation, even after using modern Optical Character Recognition (OCR)
techniques. Moreover, each court database must be manually searched for the
specified search criteria. Being a paid service, this can be both time consuming and
economically infeasible.
Sources such as LexisNexis and Google Scholar provide a keyword based search.
Case documents are available in several formats from which full text can be easily
extracted. A search for erythropoietin and its related concepts resulted in around 30
documents. However, neither LexisNexis nor Google Scholar has APIs or web
services that can be used to automate downloading large number of case documents.
Hence, we manually downloaded the 30 court cases as text documents. A docket is an
official summary of the proceedings in a court. Ideally, we would like to also include
docket information, which is currently not being held in our corpus, as it is critical to
some applications and users.
As of today, there are millions of active patents in various technology classes.
Many patent infringement cases are filed every year. For example, the number of
patent infringement appeals in the Fiscal year of 2011 increased to 426; a 7.5%
increase of the average over the past 4 years.17
This clearly establishes the importance
of court documents in both the patent acquisition and enforcement stages. Information
such as plaintiff, defendant, name of the court, case title, and case type, etc., are
important fields for any application dealing with court cases. “Designing around an
existing patent” typically uses information such as the patents involved, important
16 PACER uses Nature of Suit codes to classify cases. Code 830 represents patent infringement
cases and must be used when searching PACER. 17
Statistics related to court litigations can be accessed at - http://www.cafc.uscourts.gov/the-
court/statistics.html/ (Accessed on 03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 23
scientific literature citations, names of inventors etc.. This information is available in
the body of the court cases.
Unfortunately, the information contained in the body of a court case is not
standard across all patent litigations. As a result, the court documents downloaded
from LexisNexis are comparatively more unstructured compared to patent documents
downloaded from USPTO (see Figure 2.3). Since we are dealing with a small number
of documents (around 30), we manually parsed and marked up the data into XML
files. Currently, the marked up fields include (a) Case Title and Number; (b) Plaintiff;
927 F.2d 1200 (1991)
AMGEN, INC., Plaintiff/Cross-Appellant,
v.
CHUGAI PHARMACEUTICAL CO., LTD., and Genetics Institute, Inc.,
Defendants-Appellants. Nos. 90-1273, 90-1275.
United States Court of Appeals, Federal Circuit. March 5, 1991.
Suggestion for Rehearing Declined May 20, 1991.
…
…
Before MARKEY, LOURIE and CLEVENGER, Circuit Judges.
…
THE PATENTS On June 30, 1987, the United States Patent and Trademark Office (PTO) issued to
Dr. Rodney Hewick U.S. Patent 4,677,195, entitled "Method …” … claims of the
'195 patent are:
1. Homogeneous erythropoietin characterized by a molecular weight of
about 34,000 Daltons … 280 nanometers.
3. A pharmaceutical composition for the treatment of anemia …
homogeneous erythropoietin … vehicle.
4. Homogeneous erythropoietin … 34,000 Daltons on SDS PAGE …
280 nanometers.
…
DISCUSSION
…
Figure 2.3: Sample Court Case Document
CHAPTER 2. DOCUMENT REPOSITORY 24
(c) Defendant; (d) Court Type and Name; (e) Case Type; (f) Date of
proceeding/hearing or decision; (g) Preceding Judge; (h) Patents Involved; and (i)
General Case Body (see Figure 2.4).
2.3.3 PUBLICATIONS
In the biomedical domain, PubMed is the most comprehensive and updated library
indexing over 5000 biomedical journals from areas like medicine, nursing, pharmacy,
dentistry, healthcare, biochemistry, and bioinformatics. The National Center for
Biomedical Informatics (NCBI) uses Entrez to search and retrieve data from several
databases including PubMed, Nucleotide databases, Protein structures and many more
[92]. Such databases can be very valuable in terms of providing additional knowledge
that can be applied to searching scientific publications. Our current focus is on
retrieving scientific publications from PubMed. There are alternatives to search
<Case> <Title>Amgen, Inc. v. Chugai Pharmaceutical Co., Ltd.</Title>
<CaseNum>706 F. Supp. 94</CaseNum>
<Plaintiff>Amgen, Inc.</Plaintiff>
…. <Defendant>Chugai Pharmaceutical Co., Ltd.</Defendant>
<Defendant>Genetics Institute, Inc.</Defendant>
…. <Misc>Civ. A. No. 87-2617-Y.</Misc>
<Court>United States District Court, D. Massachusetts.</Court>
<Date>January 31, 1989.</Date>
….. <Judge>YOUNG, District Judge</Judge>
<Body> This action involves the alleged infringement of several patents covering
erythropoietin, a protein which circulates in the blood and stimulates the
production of red blood …
</Body>
</Case>
Figure 2.4: Sample Court Case XML Document
CHAPTER 2. DOCUMENT REPOSITORY 25
PubMed other than Entrez. GoPubMed is a search engine which searches PubMed
with the help of annotations from the Gene Ontology (GO) [7,35]. HubMed is a new
interface which provides plenty of features to browse and search the PubMed
repository [36].
We must note that PubMed is only an index of biomedical publications. Hence,
full-text publications of an article may not be readily available. Access to the full-text
of biomedical publications can be very important in determining relevancy. For this
purpose, we download the latest TREC genomics dataset (2007), which has been
widely used in the TREC competitions organized by NIST. The TREC dataset (fully-
downloaded and indexed on our local computer) contains over 162,000 documents
from 49 journals (dated after 1994). These are supported with their respective
MEDLINE citations and are referred to by their unique PubMed IDs (PMID) (see
Figure 2.5). Any services and databases managed by NLM have very well defined
Document Type Definitions (DTD).18,19
Citations conforming to the DTD can be
alternatively downloaded in XML format via Entrez.
MeSH descriptors are typically a group of concepts in the MeSH vocabulary
which describe a topic or a set of topics that the scientific article refers to [90]. In
some sense, this can be looked as a classification scheme for the publications
according to the MeSH ontology. MeSH descriptors are valuable and could play an
important role during the search and retrieval process [120]. In our data set, we choose
to work with a smaller subset of the MEDLINE DTD. Using standard XML parsers,
we specifically extract – (a) list of authors; (b) article title; (c) journal title; (d) PMID;
(e) abstract; (f) MeSH descriptors; and (g) MeSH qualifiers. Currently we are not
indexing the publication-publication citations although they would provide yet another
18 The Document Type Definitions for files hosted by NLM can be accessed at -
http://www.nlm.nih.gov/databases/dtd/ (Accessed on 03/01/2012). 19
The descriptions of the DTD elements for databases hosted by the NLM can be accessed at -
http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html (Accessed on 03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 26
valuable set of information to enhance the search and retrieval process. However, if
needed, the Entrez DTD provides the missing publication-publication citation
information. Since the services offered by the National Library of Medicine (NLM)
provide well defined DTDs, updating our local index with newly parsed elements if
we decide to do so in the future should be trivial.
<PubmedArticle> <MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID>10022466</PMID>
<DateCreated> <Year>1999</Year> <Month>02</Month> <Day>25</Day>
</DateCreated>
…. <Article PubModel="Print">
<Journal>
…. <JournalIssue CitedMedium="Print">
<Volume>84</Volume> <Issue>2</Issue>
….
</JournalIssue> <Title>The Journal of clinical endocrinology and metabolism</Title>
<ISOAbbreviation>J. Clin. Endocrinol. Metab.</ISOAbbreviation>
</Journal> <ArticleTitle>About the use … of an ACTH 1-39 ….</ArticleTitle>
…. <AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Grino</LastName>
<ForeName>M</ForeName>
<Initials>M</Initials>
</Author>
….
</AuthorList>
….
<MeshHeadingList>
<MeshHeading> <DescriptorName MajorTopicYN="Y">Corticotropin</DescriptorName>
</MeshHeading>
….
Figure 2.5: Sample Publication in XML
CHAPTER 2. DOCUMENT REPOSITORY 27
2.3.3.1 Identifying Ground Truth from TREC Corpus
The TREC corpus provides an excellent experimentation platform. However, we
must identify which of the 3000+ publications also co-exist in the TREC corpus. In
PubMed, publications use PMIDs to cite other publications. However, patent
documents from USPTO do not follow the same citation standards. Hence, in order to
identify the 3000+ publications that co-exist in the TREC corpus, we need to parse
each citation string in the patent documents and somehow identify its PMID. PubMed
provides a citation matcher tool which allows us to map any information we have to a
specific citation and hence a unique PMID. However, this is not an easy task since the
citation strings parsed from the patent documents are not consistent enough to use this
tool. For example, consider the following citation strings retrieved from multiple
patent documents:
1. Hansen, Jan E. et al. 1997. "O-GLYCBASE Version 2.0: A Revised Database
of O-Glycosylated Proteins." Nucleic Acid Research. vol. 25, No. 1, pp. 278-
282. cited by other
2. Daubas et al., Nucleic Acids Research, 16(4) 1251-1271 (1988).
3. Altschul et al., "Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs", Nucleic Acids Res. 25:3389-3402 (1997). cited by
other
The citations do not follow consistent patterns and hence, simple regular expression
based parsers will not perform well. Moreover, the citations are often incomplete. For
example, some citations are missing the title of the article while others may have
incomplete author list. In addition, some citation strings use full journal titles while
others simply use abbreviations.
The TREC corpus only contains more recent articles (post-1994) while the 135
patents cite articles as early as the 1970s. We begin by listing the citation strings of
CHAPTER 2. DOCUMENT REPOSITORY 28
the 3000+ publications in a text file and filter out the citation strings that are not in the
TREC corpus. Our first filtering criterion removes any citation string which does not
belong to one of the 49 journals in the TREC corpus. However, due to the inconsistent
use of abbreviations and full journal titles as explained earlier, we must first convert
all citation strings to a consistent format. NLM provides standard abbreviations for
each journal.20
For each of the 49 journals available in the TREC corpus, we extract
the standard abbreviations and convert the citation strings to a consistent format as
shown below:
1. Daubas et al., Nucleic Acids Research, 16(4) 1251-1271 (1988).
2. Altschul et al., "Gapped BLAST…database search programs", Nucleic Acids
Research, …
Our second filtering criteria removes all citation strings which represent
publications dated prior to 1994. Since the TREC corpus is complete with all articles
for the 49 journals post 1994, we assume that every resulting citation string is
available in the TREC corpus. This procedure results in a total of 1737 publications
from the TREC corpus, that serve as the ground truth for evaluating our methodology
within the ‘erythropoietin’ use case.
2.3.4 FILE WRAPPERS
In 2003, the USPTO introduced the Image File Wrapper (IFW) system to replace
the paper based system. IFWs are publicly available on the Patent Application
Information Retrieval (PAIR) service offered by the USPTO.21
Google has recently
started indexing these documents and provides a web service to download these files
20 The standard journal title abbreviations defined by NLM can be accessed at -
ftp://ftp.ncbi.nih.gov/pubmed/J_Medline.txt (Accessed on 03/01/2012). 21
The Patent Application Information Retrieval system can be accessed at -
http://portal.uspto.gov/external/portal/pair (Accessed on 03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 29
[49]. The major challenge faced with both PAIR and Google is that the files are
available only as images, that means additional processing and smart OCR algorithms
are required to extract text from them. In addition, the PAIR system blocks automatic
downloads and crawlers by enforcing CAPTCHA verification. Currently, to access file
wrappers prior to 2003, a third party agent is the best solution to convert the paper
based file wrappers to text-readable file wrappers. IFW Insight is a tool which has
indexed over 1,000 IFWs and allows one to navigate and search for critical
information contained within them [65]. However, the IFWs indexed by them are not
relevant to our use case.
Due to the challenges in obtaining file wrappers, currently, we include only one
file wrapper in our corpus for US 5,955,422. The file wrapper contains around 50
documents (office actions, amendments, etc.). The total length of the file wrapper is
around 500 pages. We received the file wrapper as an OCR’ed text file, which implies
the text can be copied and extracted but with some inaccuracies. Nevertheless, the file
wrapper is very useful for our preliminary experimental study.
Every patent application goes through a very different cycle in varying time
frames. The time frame can be lengthy and the recorded information often lacks
structure or order to the communications between the patent applicant and the
examiner. In fact, file wrappers are so different that some file wrappers contain special
documents such as an interference (see Figure 2.6). The first challenge in parsing file
wrappers is to deal with such non-structured information.
In order to understand how the file wrappers can be useful, let us examine an
example. In infringement analysis, to determine whether a patent is infringed or not, it
is important to understand the scope of the claims.22
This in turn requires
understanding of how the patent evolved from its original patent application. This
22 The word ‘scope’ is used to represent the extent of legal protection the patent claims offer.
CHAPTER 2. DOCUMENT REPOSITORY 30
involves studying how the claims, citations (both patent and scientific literature), and
technical descriptions, etc., changed with every amendment or office action. File
wrappers play a crucial role as they contain the information needed for this purpose.
For example (see Figure 2.7), the examiner’s rejection letter shows the following
differences in the claims for the patent U.S. 5,955,422:
(1) Out of the original 60, none were pursued further.
(2) 3 additional claims were filed (claims 61-63) out of which only claims 61 and
62 were accepted.
Knowing why claims were rejected could provide key information for anyone
performing an infringement analysis. Other information contained in the file wrapper
includes added or deleted references, relevant laws, and regulations which were
Figure 2.6: Contents of a File Wrapper
CHAPTER 2. DOCUMENT REPOSITORY 31
enforced and so on. Figure 2.8 shows a sample interference document which brings
out a strong relation between two patents U.S. 5,955,422 and U.S. 4,879,272, which is
otherwise not obvious from either patent. A pharmaceutical company entering the
drug market for ‘erythropoietin’ will find this information very valuable. It is worth
noticing that the interference document (shown in Figure 2.8) is very different from
the rejection (shown in Figure 2.7). Several miscellaneous documents such as the fee
structure are ignored in our model. Furthermore, each of these documents (such as
rejection, interference, etc.) is generally in the form of a letter with important
information such as restricted claims, allowed claims, rejected claims and
corresponding arguments are expressed in a mixed form within the text (see Figures
2.7 and 2.8). The second challenge faced in parsing file wrappers is associated with (1)
modeling each of these documents individually; and (b) extracting relevant
information from unstructured text. Since we are dealing with a single file wrapper,
we manually parse information in order to facilitate some amount of experimentation.
Office Action – Rejection Date: 11-06-1991 During a telephone conversation with Mr. Kokulis on March 25, 1992 a
provisional election was made with traverse to prosecute the invention of Group
VII, claims 61-63. Affirmation of this election must be made by applicant in
responding to this Office action. Claims 1-60 are withdrawn from further
consideration by the Examiner, 37 CFR 1.142(b), as being drawn to a non-elected
invention.
Claim 63 is rejected under 35 U.S.C. S 112, second paragraph, as being
indefinite for failing to particularly point out and distinctly claim the subject matter
which applicant regards as the invention.
Claim 63 is vague and indefinite in the recitation of "recombinant
erythropoietin". The specification discusses several different recombinant systems
for production of EPO. It appears that different recombinant systems produce
different modifications of the protein. It is not clear that all different modifications
are intended to be encompassed by the claims.
Claims 61 and 62 are allowed.
Figure 2.7: Sample Rejection Letter (Office Action)
CHAPTER 2. DOCUMENT REPOSITORY 32
Specifically, we extract information such as claims and citations from (a) original
patent application; (b) amendments; (c) rejections; and (d) interference documents.
Figure 2.9 shows a sample XML representation of the file wrapper.
2.4 EVALUATION AND ACCURACY
In Section 2.3, we described our methodology to download and parse documents
to extract relevant information. The extracted information is reconstructed into XML
files using appropriate field mark-ups. In order to ensure the usability of the data, a
formal analysis of the quality of the data is required. We discuss potential sources of
Office Action – Interference Date: 11-20-1992 The cases involved in this interference are:
Junior Party Patentees: Naoto Shimoda and Tsutoiau Kawaguchi
…. Serial No.: 06/784,640 filed 10/04/85, Patent No. 4,879,272 issued 11/07/89
For: Method and Composition for Preventing the Absorption of a Medicine
Assignees: Chugai Seiyaku Kabushiki Kaisha, Ukina, Kita-Tokyo, Japan
...
Senior Party Applicant: Fu-Kuen Lin
…. For: PRODUCTION OF ERYTHROPOIETIN
Serial No. 007/609,741
Assignees: Amgen, Inc., Thousand Oaks, California, A Corporation of Delaware
….
Count 1 “An erythropoietin-containing, pharmaceutically acceptable composition wherein
human serum albumin is mixed with erythropoietin.”
The claims of the parties which correspond to Count 1 are:
Lin: Claims 61-63
Shimoda et al.: Claims 3-4
….
Figure 2.8: Sample Interference Document
CHAPTER 2. DOCUMENT REPOSITORY 33
errors and suggest possible solutions to reduce the errors. Section 2.4.1 evaluates the
automatic parser discussed in Section 2.3.1.
2.4.1 EVALUATION OF THE EXTRACTED PATENT DATA
Our document repository has been manually constructed from scratch. Hence, we
do not have any pre-labeled ground truth against which the extracted patent data can
<FileWrapper> <AppNumber>957013</AppNumber>
<Date>11-06-90</Date>
<Examiner>Sharon Nolan</Examiner>
<Assignee>Kirin-Amgen Inc.</Assignee>
<Inventor>Fu-Kuen Lin</Inventor>
<Application> <Number>957013</Number>
<Claim number=“3”>A polypeptide according to claim 1 wherein 15 the
exogenous ONA sequence is a cDNA sequence<Claim>
<Description> … </Description>
….
</Application>
<Rejection> <Date>11-06-91</Date>
<RejectedClaims> <Claim>A composition according to claim 61 containing a
therapeutically effective amount of recombinant
erythropoietin.</Claim>
</RejectedClaims>
<AcceptedClaims> <Claim>61</Claim>
<Claim>62</Claim>
</AcceptedClaims>
<WithdrawnClaims> <Claim>A purified and isolated polypeptide having part or all of the
primary structural conformation and ….</Claim>
…. <Claim>An improvement in the method for detection of a specific
single stranded polynucleotide of unknown sequence in a
heterogeneous cellular….</Claim>
</WithdrawnClaims>
….
</Rejection>
</FileWrapper>
Figure 2.9: Sample File Wrapper in XML
CHAPTER 2. DOCUMENT REPOSITORY 34
be evaluated. For the purpose of this evaluation, we randomly choose 50 patents out of
the total 1150 patents in the repository (~1/20th
). These 50 patents are manually
marked up with the ground truth and also stored as XML files with the exact same
structure as the automatically parsed patent documents. An evaluation script compares
the automatically parsed XML files field by field with the manually marked up files.
The precision and recall for the extracted data is shown in Table 2.2.23
2.5 TEXT INDEX
An inverted text index; similar to the inverted indexes at the back of a book, maps
every unique token (usually words or some grouping of characters in text) to all its
occurrences in the corpus. The most basic indexes store only the list of documents
each word appears in and support simple boolean queries using logical operators such
as AND, OR and NOT. Depending on the information need and complexity, indexes
can get quite complex and sometimes even larger than original text documents, storing
additional data to support more complex queries.
Full-text search along with metadata is at the heart of many IR tools. Apache
Lucene is a free text mining library completely written in Java and provides a large
23 Precision is the fraction of retrieved instances that are relevant. Recall is the fraction of
relevant instances that are retrieved [M2].
Table 2.2: Field-by-Field Accuracy of Extracted Patent Data
Field Precision Recall
Inventor 0.96 1.0
Assignee 0.96 1.0
Title 1.0 1.0
Abstract 1.0 1.0
Examiner 1.0 0.96
Claims 1.0 1.0
Technical Description 1.0 1.0
CHAPTER 2. DOCUMENT REPOSITORY 35
variety of functions to create, modify, and search text indexes [5]. It is based on the
Vector Space Model (VSM) and supports a scoring function based on term frequency
and inverse document frequency (tf-idf) [84,111]. This section describes the
development of the text indexes used throughout this research. We provide some
necessary background related to text-indexes and Lucene. Sections 2.5.1 and 2.5.2
give brief introductions to the VSM and tf-idf respectively. Section 2.5.3 introduces
the notion of fields and how they help specific information needs. Section 2.5.4
introduces Apache Solr, a search library built on top of Lucene and describes the
indexes we developed.
2.5.1 VECTOR SPACE MODEL
The VSM is an algebraic model most commonly used in IR for representation of
documents [111]. Each document is represented in n dimensions, each dimension
representing one unique token in the vocabulary. Such a representation allows for
computing the similarity of documents with each other. A query can also be
represented as a document and hence, enables computing similarity with documents to
perform IR.
Several similarity measures are used to score documents such as Euclidean
distance, Manhattan distance, and Jaccard’s similarity, etc.. Unlike Euclidean
similarity, cosine similarity measures the angle between documents and is not affected
by the mere length of the document. Cosine similarity is the most preferred scoring
measure in IR.
( ) ∑
√∑ ( ) √∑ ( )
where q and d are the VSM representation of the query and document respectively;
and n is the number of dimensions. A simple example is shown in Figure 2.10.
CHAPTER 2. DOCUMENT REPOSITORY 36
2.5.2 TF-IDF
Term frequency (tf) is the number of times the term occurs in a document. It is
based on the notion that the most frequent words of a document describe its major
theme or content:
where is the term frequency of the term t in document d, Ct,d is the count of the
term in the document, and Td is the total number of words in the document.
Common words such as ‘the’, ‘if’, ‘and’, ‘hello’, etc., also known as stop words,
are used very frequently across all documents and do not provide any true information
content. Hence, such terms should receive low score, irrespective of their high
frequencies. The inverse document frequency (idf) measures the general importance of
a term in the corpus, penalizing terms that frequently appear across large number of
documents, including stop words:
Figure 2.10: Cosine Similarity in VSM
CHAPTER 2. DOCUMENT REPOSITORY 37
where t represents the term, N is the total number of documents in the corpus; and dft
is the document frequency of the term. Tf-idf is very widely used to score documents
against a query [84]. Several modifications of tf-idf have been introduced including
Okapi BM25 and BM25F [108]. Lucene uses query boosting and document boosting
in addition to tf-idf to score documents against a query.24
2.5.3 FIELDS AND SCHEMA
A Field is an arbitrary portion of the document that could include textual
information or metadata. Fields may also overlap in content and are defined for
specific application needs. For example, an application that only searches for authors
of publications would clearly benefit from an index which defines a field whose
vocabulary consists of Authors in the corpus. Similarly an application which defines
specific search criteria over the Abstract of a publication would require an index
defined over only the Abstracts of publications in the corpus. Indexes can support
multiple applications by defining more than one field. For example, both applications
described above could be used over an index which defines two fields – one over the
Authors of publications and the other over the Abstract. Additionally, fields can be
scored independently, hence improving the focus of the search. Each field can be
indexed with different parameters such as different stop word lists, tokenizers25
and
filters. The storage options for each field can be independently specified. Usually,
shorter fields such as titles and metadata are stored in the index in order to be retrieved
during the searching phase. Lengthier fields are often indexed, but not stored to save
24 Details on Apache Lucene’s scoring function can be found at -
http://lucene.apache.org/java/3_0_0/scoring.html (Accessed on 03/01/2012). 25
Tokenization is the process of parsing characters in a stream based on a certain pattern. For
example, the white space tokenizer identifies tokens that are separated by white spaces.
CHAPTER 2. DOCUMENT REPOSITORY 38
space. Having established that fields are advantageous for various reasons, they need
to be predefined in an index schema. For each document, i.e. patents, court cases and
publications, we have built a text index based on the XML schemas discussed in
Section 2.3. These indexes are used for IR throughout the rest of this research.
2.5.4 SOLR
Apache Solr is a search library based on Lucene [6]. Solr provides several added
functionalities, which are seen very commonly today across many existing search
engines on the web. These functionalities include aggregations, faceting, dynamic
fields and so on. Faceting is the process of grouping search results based on a
particular property they share in common. For example, a search result for books on
Amazon, will allow the user to filter books or view the search results through the
Author facet. This functionality is extremely useful, especially in the case of querying
large amount of data to quickly filter the results to a relevant set. We especially use the
dynamic fields feature of Solr to create a common schema for all documents. All text
based fields are configured with a suffix “_text” and all metadata fields with “_meta”.
Creating a schema this way allows us to arbitrarily modify the fields for current
documents and add new information sources without having to modify other code
interfacing with the schema.
2.6 RELATED WORK
The discussion in this chapter focusses around – (1) the current publication
standards and challenges faced in accessing information from the patent system; and
(2) the unstructured nature of documents that makes additional parsing techniques a
necessity to extract relevant information. There is a great deal of research in the areas
of information interoperability, management, and extraction which are closely related
to the development of document repositories. This section provides a brief overview of
CHAPTER 2. DOCUMENT REPOSITORY 39
existing research related to these areas and discuss possible future extensions to our
document repositories.
2.6.1 INTEROPERABILITY, INFORMATION FRAMEWORKS AND SEMANTIC WEB
Interoperability between various entities in the government is a very important
factor [56,112]. Due to the need for interoperability, many governments are adopting
interoperability frameworks which support a wide range of document formats such as
PDF, HTML for web publishing, XML for semi-structured representation, PNG and
JPEG for images, and standard web services such as REST [56]. The problem of
interoperability is rather more general and not limited to the government sources
alone. While existing interoperability frameworks deal mainly with system
heterogeneities, the ‘linked online data’ community strongly believes that the internet
is transforming into a web of data as opposed to simply a web of documents [13]. The
goal of the semantic web is to make the information computer understandable, rather
than simply computer readable [13]. Several governments are realizing the importance
of semantics and are strongly supporting ontologies and external knowledge entities in
their interoperability frameworks [57]. One future direction is to study the impacts of
such frameworks on improving access to the information in the patent system. In the
context of scientific publications, Berners-Lee and Hendler claim – “In the next few
years, we expect that tools for publishing papers on the web will automatically help
users to include more of this machine-readable markup in the papers they produce”
[10]. Future research can also explore techniques to improve publishing of legal and
government data in the patent system with the help of automated tools to annotate
data.
2.6.2 DIGITAL REPOSITORIES
Academic institutions are increasingly using digital repositories such as DSpace
and Fedora to publish, access and archive educational material [17,27,82,125]. Such
CHAPTER 2. DOCUMENT REPOSITORY 40
repositories can be used to manage any form of digital data including documents.
Branin claims that such repositories are slowly being adopted by non-academic
institutions such as smaller government entities as well [17]. To study how such
repositories can help advance the current state of information management in the
patent system can be a fruitful area of research for several reasons. Firstly, digital
repositories such as DSpace and Fedora comply with standards for repository
interoperability such as the OAI-PMH26
. Additionally, they support the use of
ontologies such as Dublin Core27
for metadata and domain knowledge based on
OWL/RDF which can be embedded upon the digital repositories to improve retrieval
[82,125]. While DSpace, Fedora, and the likes are still very much evolving, they offer
a lot of potential for growth and integration with existing database and text indexing
technologies and thus provide a very strong platform for building document
repositories.
2.6.3 DOCUMENT PARSING AND INFORMATION EXTRACTION
Feature extraction and document parsing involve several subtasks based on
Natural Language Processing and related fields [84]. The addition of these extracted
features can potentially enhance the quality of the document repository by aiding in
browsing and retrieval [77]. Named Entity Recognition (NER) has been used to
categorize terms in text into biomedical entities such as genes and drugs [40]. The
information in documents not only exists as terms or shorter phrases, but also in the
form of longer sentences and fields such as the claims of a patent. Difficulties in
parsing patent claims and potential solutions to the same have been discussed
[114,116,130]. In general, identifying claims in text can provide important information
about that document. Blake discusses a methodology based on statistical parsing of
26 Open Archives Initiative – Protocol for Metadata Harvesting.
http://www.openarchives.org/OAI/openarchivesprotocol.html (Accessed on 03/01/2012). 27
Dublin Core Metadata Initiative Specifications – http://dublincore.org/specifications/
(Accessed on 03/01/2012).
CHAPTER 2. DOCUMENT REPOSITORY 41
sentences to identify scientific claims in publications [14]. Ultimately, the vocabulary
used among the various information sources can be immense and so is the scope for
feature extraction. Techniques such as NER and statistical parsing can be further
enhanced and trained on the data from this repository to improve feature extraction.
Chapter 3 explains our knowledge-based approach which dynamically annotates
knowledge to the documents based on the information need.
Chapter 3.
METHODOLOGY
3.1 INTRODUCTION
The patent system is comprised of many information sources which collectively
provide a valuable source of knowledge for any technology related task. However, the
diversity among the information sources makes information retrieval from the patent
system challenging. Firstly, technology (domain) specific terminological
inconsistencies drastically affect search. Traditional term based search methodologies
do not account for the use of synonyms, abbreviations, and hyponyms, etc.. Secondly,
there is little or no interoperability between sources, caused by the fact that each
information source is managed by independent and disjoint organizations and
agencies. Lastly, most current methodologies tackle terminological inconsistencies
and information source interoperability as separate issues. An integrated framework
for IR would require combining both the methodologies to search and integrate
multiple sources, while keeping in mind the user’s context and underlying information
need. In this chapter, we will discuss three distinct methodologies addressing the
above issues – (a) knowledge based approach using domain knowledge to tackle
terminological issues; (b) developing a Patent System Ontology (PSO) to provide a
shared vocabulary between information sources and interoperability; and (c) an
CHAPTER 3. METHODOLOGY 43
information retrieval framework that combines methodologies from (a) and (b) along
with user feedback to search and integrate information across multiple sources.
Terminological inconsistencies are very typical, especially in domain specific text.
These inconsistencies are caused due to the variant usage of a term, i.e. its synonyms,
abbreviations, parent concepts, etc.. For example, the term ‘Whale’ and ‘Cetacea’ are
synonymous.28
While a domain expert may understand the meaning of ‘Cetacea’, it is
harder for one who is not an expert in animal nomenclature. Similarly, legal
terminology may not be well understood by technical experts. Domain ontologies are
sources of knowledge, developed by experts in the field to produce a shared
vocabulary within a technical domain. Gruber defines ontologies as – “formal, explicit
specification of a shared conceptualization” [53]. Several studies have looked at using
domain knowledge to improve IR [8,35,45,46,47,63,84,88,132]. However, the
terminological usage significantly varies between information sources. In fact, domain
knowledge from several areas, e.g. technical and legal, are simultaneously required to
achieve high level of semantic interoperability in the patent system. Our knowledge
based methodology builds on existing developments and addresses the above issue of
applying domain knowledge to different information sources. Specifically, we deal
with biomedical ontologies to enhance a user’s query to include related terms and
discuss how the technique can be modified in order to improve precision and recall.29
The various types of documents in the patent system such as patents, court cases,
and scientific publications are very strongly inter-related, even though they are
semantically, syntactically, and structurally very different. For example, a patent
litigation document frequently refers to related patent numbers, scientific publications,
patent inventors and assignees, and domain experts such as authors and editors of
prominent journals. These cross-references are seen across all other document types in
28 Wikipedia article on Whales – http://en.wikipedia.org/wiki/Whale (Accessed on 03/01/2012).
29 The measures Precision and Recall are defined in Section 4.2.
CHAPTER 3. METHODOLOGY 44
the patent system and implicitly provide strong relevancy measures between
documents. Since our goal is not explicitly targeted to produce the best search results
for a query from all information sources, but to provide a set of strongly related
documents instead, we develop a Patent System Ontology (PSO) to formalize the
representation of documents and explicitly state the cross-references (or relations)
between them.
The use of domain knowledge and the patent system ontology provide the basis for
searching and integrating multiple sources. IR is seldom a one-step process, but in fact
a multi-stage process. Information from the results of one search forms the query for
another search and so on. For example, the search for prominent court cases could
potentially lead to a more focused search of the patents involved. Moreover, it is hard
to disambiguate the context of the user query in a single step and thus, user input must
also be given significance. We develop the IR framework which combines the use of
domain knowledge, the patent system ontology and user feedback to provide a
powerful multi-domain search.
The rest of this chapter is organized as follows: Section 3.2 presents our
knowledge-based methodology to expand the user’s query in order to provide higher
recall. We realize that this isn’t sufficient to produce high precision results, and thus
discuss strategies to provide a high coverage, yet an acceptable precision. Section 3.3
presents the patent system ontology, which provides a structured and standardized
representation for the information sources in order to facilitate information source
interoperability. We present a detailed discussion regarding the development of the
ontology and its advantages. Section 3.4 presents the IR framework; an iterative
methodology to search and integrate information across multiple sources in the patent
system. The implementation details of our tool are briefly discussed in this section.
There is a plethora of related research which forms the basis for our work; related
research is discussed in Section 3.5.
CHAPTER 3. METHODOLOGY 45
3.2 BIO-ONTOLOGIES
Biomedicine and related fields are rapidly advancing, giving rise to an exponential
growth in information and data. This rise in information is exposing the lack of
standards for terminology, representation and information exchange within sub-
domains. This affects both researchers and applications which rely on the generated
biomedical data. For example, if researchers are allowed to coin their own term for an
existing concept each time they write about it, it would be impossible to maintain a
shared vocabulary and understanding between researchers in the domain. Over the
past decade, bio-ontologies have extensively been developed and used in the field of
biology. Bodenreider and Stevens argue that although ontologies initially started out
as primarily a Computer Science (CS) effort to help annotate biological data, the
ontologies have been increasingly adopted by the biologists themselves to annotate
and share biomedical data [15]. Also, unlike fields such as physics or chemistry,
biological data is seldom represented in pure mathematical form. Hence, sharing
knowledge has been the driving force for such transition from a pure CS effort to a
combined effort with biologists playing an equally important role [15]. The resulting
domain knowledge is being used by a wide range of applications including
genome/genotype/phenotype tagging [76], information retrieval [35,63], and cross-
database searching [70,103], etc.. Such a wide range of applications clearly establish
the significance of biomedical ontologies in such a rapidly advancing field. In this
section, we will explain how applications benefit from the use of bio-ontologies
through examples and discuss our methodology in using biomedical ontologies for
information retrieval in the patent system.
There are several initiatives and groups which develop and maintain biomedical
ontologies aimed at providing a shared vocabulary and advancing research in the
domain. The Gene Ontology (GO) provides a controlled vocabulary of terms for gene
and gene product characteristics [7]. On the other hand, the Symptom Ontology (SO)
CHAPTER 3. METHODOLOGY 46
covers purely signs and symptoms [122]. The ontologies vary drastically in their
domains (e.g. genes v/s symptoms), size (e.g. the GO has 35786 concepts while the
SO has 934 classes), and representation languages (OWL, OBO, and RDF, etc.). This
results in inconsistencies between available biomedical ontologies. For example, if an
application needs to use two ontologies with completely different representation
languages, the application will most likely have to support two entirely different APIs.
These inconsistencies are resolved by BioPortal; an online open repository of over 250
biomedical ontologies in various forms such as OBO, OWL, RDF and Protégé Frames
[95]. BioPortal provides a thorough list of web services in order to query the ontology,
abstracting the various underlying formats to a standard API. The web services
provide a convenient and programmatic access to the biomedical ontologies that can
be conveniently integrated into several applications, avoiding the need to separately
index each of the ontologies. BioPortal has grown from 72 ontologies in 2008, to 134
ontologies in 2009, and continues to show an increasing number of ontologies being
added to the repository and is clearly the largest repository available online [95]. Our
work uses BioPortal for querying biomedical ontologies. However, based on the use
case and the data set that we are working with, we limit our usage of ontologies to a
much smaller subset to keep the results and methodology tractable. The selected
ontologies are summarized in Table 3.1.
In order to understand the importance of bio-ontologies, let us consider an
example. Suppose we want to find any study on ‘chronic kidney disease’. In order to
limit our focus, we will add a geographic constraint on the study, such that any
reported results must be correlated with Tyrol, Austria. Hence, the formulated query
will look like: {‘chronic kidney disease’ AND Tyrol}. A search for this query on the
local TREC corpus retrieves zero documents. In this example (see Figure 3.1), we use
the National Drug File Ontology (NDF) to extract the semantics of the phrase ‘chronic
kidney disease’ to include synonyms such as ‘esrd’, ‘end stage renal disease’, ‘end
stage kidney disease’ and so on. The new query is represented following the PubMed
CHAPTER 3. METHODOLOGY 47
representation as follows: {‘chronic kidney disease’ [NDF] AND Tyrol} where
[NDF] indicates that the preceding term or phrase is expanded using the NDF
ontology. Upon further examination, we realize that the phrase ‘chronic kidney
disease’ is actually never used in the text of the document; however, its synonyms,
‘esrd’ and ‘end stage renal disease’ are used. This clearly shows that without the
synonymy knowledge from NDF, this document would have never been retrieved. We
also observe that the terms ‘DM’, ‘DM-2’ and ‘type 2 diabetes mellitus’ are
synonymously used. This not only shows us the inconsistent terminological usage
between documents and authors, but within the same document. This example presents
a very restricted case. However, if we were to relax the constraints by moving up the
hierarchy in NDF to ‘kidney diseases’, we arrive at a slightly broader set of 3
publications in the TREC corpus. In fact, we could move to a geographically broader
Table 3.1: Summary of the Selected Biomedical Ontologies
Ontology Number of
Classes
Details
Medical Subject Headings
(MeSH)
229698 National Library of Medicine’s
controlled vocabulary and
classification [N1].
National Cancer Institute
Thesaurus (NCI Thesaurus)
89129 Clinical care and Health care [G4].
National Drug File (NDF) 40104 Classification of drugs, ingredients
and their clinical use [B5].
Gene Ontology (GO) 35786 Provides a controlled vocabulary for
genes and gene product characteristics
[A1].
COSTART 1641 Maintained by the Food and Drug
Administration (FDA) for controlling
adverse reaction terminology [C4].
Symptom Ontology 934 Provides a controlled vocabulary for
signs and symptoms, and their
relationships [S5].
International Classification
of Diseases (ICD – 9)
21669 Standard classification for diseases
[W4].
CHAPTER 3. METHODOLOGY 48
region and query for ‘kidney diseases’ with correlation to Austria, this would result in
many more results.
The application of biomedical ontologies is not limited to biomedical publications
alone. Figure 3.2 illustrates the use of the NCI Thesaurus for information retrieval in
patent documents. Following the previous example in Figure 3.1, we search for the
concept ‘epor’ in the claims of all patents in our repository. The search results in zero
patents being retrieved, as in the previous example. NCI provides knowledge that
‘epor’ is synonymous to ‘erythropoietin receptor’ and ‘epo-r’. The new query thus
retrieves a total of 20 patents, each of which mentions the concept ‘epor’ in their
claims. Another interesting observation is that the patent is titled “Use of cytokine
receptors …” which is a parent concept to ‘epor’ according to NCI Thesaurus. This
expansion of the user query forms the basis of our methodology.
Figure 3.1: The Importance of Domain Knowledge in Retrieving Scientific
Publications
CHAPTER 3. METHODOLOGY 49
The rest of this section is organized as follows: Section 3.2.1 discusses the general
form of the expanded query. Query expansion has been reported to perform erratically;
sometimes the techniques improve performance but deteriorate performance at other
times [59]. Having established in the previous examples that synonymy is an
important aspect of query expansion irrespective of the type of document, we must
understand the causes of such random behavior of using synonymy and related
expansions. Section 3.2.2 discusses the effects of choosing the correct source for query
expansions. Section 3.2.3 discusses how different indexing parameters such as scoring
functions can affect the search. Section 3.2.4 discusses the effects of varying the
granularity of the query, i.e. at the sentence level, paragraph level or the whole
document level. The effects of querying different fields in the documents such as the
Title, Abstract, etc., are also discussed. We realize that automatic expansion
techniques may not always produce good results. Hence, Section 3.2.5 discusses an
extension of a co-occurrence visualization based tool, MINOE [37], to allow users to
Figure 3.2: The Importance of Domain Knowledge in Retrieving Patent
Documents
CHAPTER 3. METHODOLOGY 50
navigate ontology hierarchies and manually include search terms as an exploratory
model.
3.2.1 QUERY EXPANSION: GENERAL FORM
Query expansion techniques have been around in IR for quite some time [8,84].
They can be categorized into three general forms based on user assistance, manual
thesaurus, and automatic thesaurus construction [84]. Query expansion techniques
which rely on an external resource such as thesaurus or an ontology are increasingly
being adopted in IR methodologies [11]. In this section, we will focus on using
ontologies to expand the user’s initial query. A mathematical form for the same is
presented.
In addition to synonyms, domain ontologies provide additional relation between
terms in the form of hierarchical categorization into subclasses and super-classes (via
the rdfs:subClassOf relation). Figure 3.3 describes an example where both synonymy
relations and hierarchical relations are to expand the user query. As an example, we
take the TREC topic 236 – “What [TUMOR_TYPES] are found in zebrafish?” and
attempt to illustrate how ontological relations are used. We assume the baseline query
for this topic is ‘Tumor AND Zebrafish’. For the sake of representation, we follow the
PubMed syntax, where ‘Tumor [MeSH]’ indicates that the term ‘Tumor’ is to be
expanded using the MeSH ontology. In order to extract synonyms from the MeSH
ontology, the baseline query ‘Tumor AND Zebrafish’” can be rewritten as ‘Tumor
[MeSH] AND Zebrafish [MeSH]’, which actually translates into:
Q: {Tumor OR Cancer OR Neoplasm …} AND {Zebrafish OR Danio Rerio …}30
30 The default query expansion uses the OR operator to expand synonyms and the AND operator
between search clauses.
CHAPTER 3. METHODOLOGY 51
However, this search results in a large collection of documents. We navigate the
MeSH hierarchy to include more specific concepts such as ‘Leukemia’ and
‘Melanoma’ and perform the search which results in a smaller set and possibly more
precise set of documents (see Figure 3.3). For the sake of readability, other sub-classes
of ‘Tumor’ are not displayed. In cases where the user query is more specific, it helps
to move up the hierarchy and include parent concepts as well. This form of vertical
expansion could proceed in both directions resulting in a query which looks like:
Qtumor:= {Tumor OR Cancer OR Neoplasm OR Leukemia OR Melanoma OR
Diseases OR …}
Figure 3.3: Query Expansion along MeSH Hierarchy to Retrieve Relevant
Documents
CHAPTER 3. METHODOLOGY 52
Qtumor:= {{Tumor OR Cancer OR Neoplasm OR …} OR {Leukemia OR Melanoma
OR …} OR {Diseases OR …} OR …}
Qtumor:= {{synonyms} OR {children} OR {parents} OR ….}
Alternatively, this can be represented as a vector of terms:
Mtumor =
In fact, domain ontologies provide even more knowledge than synonyms and class
hierarchy. For example, NDF provides over 100 related drug names for the disease
‘Anemia’ (see Figure 3.4) via the ‘contraindicated_drug’ property. Although this is
very specific to this ontology, future enhancements to this methodology could include
such information, if it pertains to the user query. For example, if a query specifically
asked for drugs related to the disease ‘Anemia’ from the NDF ontology, the property
‘contraindicated_drug’ would be useful.
Generally, including high level concepts will improve the recall but will affect the
Figure 3.4: Relations in Domain Ontologies
CHAPTER 3. METHODOLOGY 53
precision of the search. Hence, we would like to penalize the more general terms, and
boost the more specific ones by weighting the query appropriately. This grouping
allows us to assign weights to query terms, so that we have a chance to manipulate
precision. Therefore,
QTumor = Tumor [MeSH] => WT
MTumor
where WT is a vector of weights in the range [0, 1].
Different documents make different use of technical language. For example, a
court case makes far less use of technical jargon than a scientific publication would. If
the same expansion scheme is applied to both types of documents, the results could be
imprecise. In some cases, it helps to expand to more general terms and in other cases
to more specific terms. Hence, it is important to estimate what form of expansion is
appropriate for different types of the documents. Also, we cannot apply the ranking
schemes or query expansion schemes to all types of documents alike. Therefore, we
define independent weight vectors for the expanded query as appropriate for each
information source (see Figure 3.5). Hence the resulting queries for patents and court
cases are:
QPatent, Tumor = WT
Pat MTumor
QCase, Tumor = WT
Case MTumor
where WT
Pat and WT
Case are different weight vectors corresponding to patent and court
case information sources, respectively.
A similar procedure is followed for the other bio-terms in the query. Ideally, the
weight vectors should be learnt but for now we will heuristically assign weights to the
expanded query.
CHAPTER 3. METHODOLOGY 54
3.2.2 EFFECTS OF CHOOSING THE RIGHT ONTOLOGY
The methodology described in Section 3.2.1 expands the query terms to related
terms using one or more biomedical ontologies. Using multiple ontologies for
expansion can potentially increase the coverage (recall) of the search. Such high
recalls may be desirable for some fraction of applications. However, as explained
earlier in Section 3.2, groups developing biomedical ontologies focus on a different
sub-domain and hence, using multiple ontologies can lead to potential issues which
will eventually improve recall, but lower precision. In this section, we outline potential
sources of imprecision by comparing several ontologies and discuss how selection of
ontologies can affect the search results.
Figure 3.6 shows a comparison of three different ontologies for the same concept
‘erythropoietin’. Each ontology classifies the concept under different contexts. For
example, NCI thesaurus classifies ‘erythropoietin’ in the context of a ‘protein’ or
‘amino acid’ while NDF additionally classifies ‘erythropoietin’ in the context of a
Figure 3.5: General Form of the Expanded Query
CHAPTER 3. METHODOLOGY 55
‘carbohydrate’ and a ‘hormone’. While these higher level contexts are still highly
related, expansion along NDF may result in terms such as ‘carbohydrates’,
‘chemical’, and ‘drug’, which will not be derived from NCI. Moreover, concepts from
different ontologies may contain conflicting information. For example, ‘epoetin alfa’
and ‘erythropoietin’ are synonymous as per NCI thesaurus and have a hyponym-
hypernym relation in NDF. While the knowledge provided by both the ontologies is
correct in the context under which they are classified, choosing one ontology over the
other could alter our search results. Furthermore, choosing both the ontologies may
produce a conflict, whether the term ‘epoetin alfa’ should be considered a synonym, or
a hyponym. Depending on the vocabulary of the ontology, some query terms may not
even be covered under the ontology’s domain. For example, GO classifies
‘erythropoietin receptor binding’ (a synonym of ‘erythropoietin’) as a ‘molecular
function’.
In the case of certain queries, selection of ontologies is very obvious and easy. For
example, an obvious source for gene names is GO and an obvious source for drug
names is NDF. Additionally, it also helps if the user of an application knows exactly
which ontology to select for query expansion. However, given that – (a) not all queries
Figure 3.6: Comparison between Multiple Biomedical Ontologies
CHAPTER 3. METHODOLOGY 56
can be disambiguated easily; and (b) not all users are domain experts; some criterion
for ontology selection becomes important. In the description of the TREC data set by
Hersh et al [59], a source of terms is suggested for each of the 36 topics. This is a great
starting point in order to continue query expansion and perform experiments. In
reference to automated ontology evaluation and selection, Sabou et al [110] claim that
ontology selection is generally based on algorithms which compute the popularity
[32], richness of knowledge [3], and topic coverage [22]. Maiga and Williams present
a user-input based ontology evaluation and selection tool [83]. However, for our
problem, where we are given a query and need to choose the appropriate ontology, the
problem of ontology selection strongly depends on the context more than ontology
parameters such as size and popularity. A potential research direction could involve
studying how word sense disambiguation techniques and simple classification models
can help in ontology selection [91,102].
Although this provides an exciting sub-topic for research, it is out of scope of this
thesis. Thus, we will manually choose ontologies in order to perform query expansion.
In Chapter 4, we perform experiments by manually choosing ontologies and study the
effect of ontologies on information retrieval in the patent system.
3.2.3 EFFECTS OF INDEXING PARAMETERS
Section 2.4 discusses several parameters including choice of tokenizers, stop word
lists, stemmers and scoring functions that can be manipulated when indexing
documents. These parameters could apply to specific fields or the entire document
index and eventually affect Information Retrieval (IR) either positively or negatively.
For example, experiments by Ide et al suggest that morphological expansion provides
better results than using stemming [63]. Also, instead of using the standard English
stop word list, some studies use special stop word lists in order to filter common
words specific to that domain [138]. Other possible variations in indexing techniques
CHAPTER 3. METHODOLOGY 57
include the usage of different tokenizers. The standard English tokenizer ignores
punctuations, white spaces and special symbols. However, in the biomedical domain,
several names of genes or drugs are a combination of special symbols, numbers and
characters such as ‘BRCA-1’ and ‘p53-gene’. Ide et al claim they achieved the best
results for tokenizers which indexed at the most granular level, and then combined all
characters to form the original biomedical term during the querying phase [63]. Most
of the variations discussed so far account for little improvement in the overall
performance of an IR system [59]. Scoring functions are another important parameter
to choose when constructing text indexes. Okapi-BM25 and BM25F are variations of
the original tf-idf scoring model, which have shown improved IR performance
[84,108]. BM25 is defined by [108]:
∑ ( ) ( ) ( )
( ) ( )
where tf(qi,D) represents the term frequency of query term qi in document D, |D| is the
length of the document and avgdl is the average length of all documents. Constants k
and b are usually chosen to be in the range [1.2, 2.0] and 0.75 respectively. The
inverse document frequency [108],
( ) ( )
( )
where N is the total number of documents in the corpus and n(qi) is the number of
documents containing the query term qi.
In our methodology, we implement both scoring functions and study how it affects
the information retrieved, especially with the extracted domain knowledge,
implementation effort, etc.. Pérez-Iglesias provides a Lucene implementation of the
CHAPTER 3. METHODOLOGY 58
BM25 scoring function which makes it easier to integrate with our work flow and
framework [66].
3.2.4 SCOPE OF THE QUERY TERMS
In this section, we experiment with the scope of the query terms using two
parameters – (1) limiting terms to specific fields such as titles and abstracts; and (2)
the distance between multiple (AND) clauses in the query.
Different fields of a document such as titles and abstracts provide different depths
of details about the documents. While the abstract may provide an overview, for
majority of the times, the title tries to capture the major theme into a single sentence.
Following this notion, we assume that the terms appearing in the title can potentially
act as the strong descriptors of the document. Certain patent-related applications
emphasize especially on the terms used in the claims, rather than other fields.
Scientific publications available from PubMed do not always contain the full-text. In
fact, the documents are indexed with their descriptors, which are derived from the
MeSH vocabulary. We study the effect of the field of search in our methodology, by
limiting searches to specific fields of the documents such as the title, abstract, MeSH
descriptors for PubMed documents and so on. Based on the results, it would be
possible to derive an interpolated model such that each field is individually weighted
in accordance with its importance for that specific application.
When searching for specific content, very short queries (e.g. single term queries)
will not be very effective due to the volume of available information. Adding more
terms to a query in AND clauses is equivalent to adding more constraints, thus,
making the search more specific. However, the tf-idf model can give a high score to
documents which contain the search clauses, even if they are not in relation to one
another. To ensure that the documents more relevant to the query get a higher score,
we impose a distance constraint to the search clauses, following that they will not be
CHAPTER 3. METHODOLOGY 59
very far away from one another in the document, if the document is relevant. Table 3.2
shows the retrieval results for the TREC topic 231 for different distances between
search clauses. This preliminary experiment validates our hypothesis.
3.2.5 INTERACTIVE MODEL FOR VISUALIZATION
Information needs come in various abstraction levels. For example, the TREC
topic 231 (“What [TUMOR_TYPES] are found in Zebrafish?”) has a specific
information needs. The expected results are also very precise, usually within a few
sentences. In contrast, a general search for technologies in the medical imaging space
is much broader, resulting in a much larger number of documents. These varying
abstraction levels of information needs are hard to be captured in the user’s query. As
an alternative to the automatic query expansion, we developed a visual exploratory
model based on term co-occurrence.
Term co-occurrence is a strong indicator of context and association. A visual
model of term co-occurrence can provide significant information about the query
terms, their association, and other surrounding terms. We extend the visualization
module of such a co-occurrence based model, MINOE, originally designed for
exploring marine ecosystems [37]. In Section 3.2.1, we explained that both vertical
and horizontal (synonym) expansion of terms are useful. As an alternative to the
automatic query expansion, we annotate MINOE’s visual co-occurrence graphs with
domain ontologies, to allow users to manually explore the hierarchies of biomedical
ontologies over the document repository.
Table 3.2: Effect of the Distance between Search Clauses
Distance Precision Recall F-Measure
Entire Document 0.03 0.876 0.05
Within 25 terms 0.574 0.876 0.69
CHAPTER 3. METHODOLOGY 60
The user interface for this tool is flexible and has several features (see Figure 3.7).
Each term represents the entire concept including its synonyms and hence, the search
will automatically include all synonyms. The term connections represent an
association (co-occurrence) between two terms. The size of the terms and the
connections on the graph represent the strength of the connection. The users can
navigate hierarchies by choosing to add child concepts or parent concepts until a
satisfactory abstraction level is reached. This integration of domain ontologies and
MINOE’s visualization module result in a powerful combination and an exploratory
tool.
3.3 PATENT ONTOLOGY
Interoperability between information sources is essential in order to perform multi-
source IR. In this section, we describe a patent system ontology which provides
standardized representation and a shared vocabulary of the information sources to
facilitate interoperability. The ontology will also provide the required declarative
syntax to express multi-source queries, rules, and relevancy metrics.
Figure 3.7: Visualizing Concept Co-occurrences using MINOE
CHAPTER 3. METHODOLOGY 61
There is a large community working towards the development of ontologies,
knowledge representation, and engineering [16,28,53,54,55,72,81,93,94,127,128].
Several ontology development methodologies have been proposed and implemented
over the years. We reviewed some of the methodologies which are most applicable to
the development of our patent system ontology [20,28,54,93]. In general, the
development of ontologies consists of several steps starting from the conceptualization
of the domain, defining the properties inter-relating the defined classes, instantiating
the classes with physical objects and the verification of the constructed ontology. In
their paper Ontology 101, Noy and McGuinness state that ontology development is
essentially an iterative approach where the ontology evolves to satisfy the
requirements of the application it is being designed for [93]. We follow the Ontology
101 development methodology to (1) define the scope and the application of the
ontology; (2) conceptualize each information source and build a hierarchy of classes;
and (3) define properties and relations on each of the classes. The resulting ontology is
instantiated with actual physical documents from the document repository.
It is important to determine the specification language in which the ontology will
be coded. Several specification languages have evolved over the years including frame
based languages such as F-Logic and OIL, and descriptive logic based languages such
as DARPA Agent Markup Language and Ontology Inference Layer (DAML+OIL),
Resource Description Framework (RDF) and Web Ontology Language (OWL)
[98,107]. Description Logic (DL) based languages were developed to overcome the
lack of formal logic-based semantics in frame based languages. Several factors need to
be considered when choosing a specification language for the ontology which include
expressivity, semantics, reasoning capabilities, availability of tools, re-use and
personal preference. RDF is a widely used language to conceptualize domains. OWL
is a W3C recommendation which is built on top of the semantics of RDF to provide
higher expressivity levels. These higher expression levels allow us to define disjoint
classes, ‘sameAs’ or different individuals and class property restrictions among others
CHAPTER 3. METHODOLOGY 62
[98]. Several tools have also been developed for the construction and modeling of
ontologies such as Protégé and Chimaera [24,104]. Protégé is widely used in the
ontology engineering community. Protégé supports both OWL and RDF, and provides
useful features and plugins allowing us to query and visualize the ontology. Taking
into account the above mentioned considerations, we choose OWL as the specification
language and Protégé-3.4 as our development tool for the patent system ontology.
However, not all OWL axioms are highly scalable and hence, to the extent possible we
make maximum use of the RDF subset of the OWL axioms.
The rest of this section is organized as follows: Section 3.3.1 presents a list of
competency questions which are used to define the scope of the ontology and perform
a preliminary evaluation in Section 3.3.4. The generated competency questions are
typical application scenarios and directly reflect the potential of the ontology. In
Section 3.3.2, the domains are conceptualized and classes are extracted based on the
competency questions. Relations are defined over the classes and cross-references are
explicitly stated. The resulting ontology is populated with actual instances of physical
documents for further evaluation and use. The current scope of the ontology is limited
to patents, court cases, and file wrappers.
3.3.1 DEFINING SCOPE OF THE ONTOLOGY
Ontologies are typically developed with specific applications as targets. Gruninger
and Fox suggested that a set of competency questions be developed; these are
questions that the ontology is expected to answer [54]. Developing these questions not
only helps define the scope of our ontology but also allows us to verify the usefulness
of the ontology both throughout and after the development phase [93]. In Chapters 1
and 2, we mentioned a few of our target applications such as patent prior art search,
patent claim invalidation, and patent infringement analysis. These applications aren’t
very different from one another and in fact, they go hand in hand in most scenarios.
CHAPTER 3. METHODOLOGY 63
Keeping these applications in mind, we define a set of competency questions which
confine to a single domain such as patents, and also span multiple domains. The
competency questions in no way limit the use of the ontology to these applications
alone, rather they are examples of questions the ontology must be capable of
answering at the minimum. The list of competency questions presented is not meant to
be an exhaustive list, but to illustrate how the metadata and text fields parsed from the
documents in Chapter 2 are used in the context of the patent-related applications.
Patent Domain:
Return all patent documents which contain the phrase ‘recombinant erythropoietin
receptor’ in the claims
Return all the patent documents which contain the phrase ‘recombinant
erythropoietin receptor’, at least 3 claims, issued before 02-02-1999 and assigned
to Genetics Inc.
Court Case Domain:
Return all court cases which contain the term – ‘erythropoietin’
Return all court cases which involve the company Amgen Inc. either as the
plaintiff or defendant, and from the District Court of Massachusetts
Scientific Publication Domain:
What percentage of articles in the journal Blood are contributed from authors
located outside the US?
Return all articles by author John Doe from the journal Nature
Multi-domain:
CHAPTER 3. METHODOLOGY 64
Return all patents which contain the term – ‘erythropoietin’ in their claims, which
are involved in at least one court litigation.
Search the titles of scientific publications for the terms from the claims of patent
5,955,422
The questions can get more complex depending on the requirement of the user.
The results of one query can further be re-filtered with additional constraints:
Return all court cases with the term ‘erythropoietin’. From these court cases,
return the patents involved. From these patents, follow the backward and forward
citations to identify more important patents.
Notice that the last bullet point is the method we followed to identify the 5 core
patents assigned to Amgen, and the 135 patents relevant to our use case. In each of the
questions, the main terms (or objects) are underlined and show that there is some
relationship between these terms. First, these terms are grouped together into concepts
or classes such that they represent a collection of items corresponding to that term.
Second, relations are drawn between classes such that the competency questions can
be sufficiently expressed as a query using those classes and relationships. This is also
known as a bottom-up approach in constructing an ontology.
Relations in OWL are binary relations, i.e. they can be used to relate exactly two
classes, two individuals or an individual to a value. These can be represented in triple
form as {subject, predicate, object}. The values that the subject and object take on can
be restricted by defining the domain and the range of the relation; where domain refers
to the subject end of the relation and range refers to the object end of the relationship
[61]. OWL additionally allows us to define logical characteristics such as transitivity
and symmetry on these binary relations which enhance the meaning of this relation.
For example, if the ‘=>’ relation is defined as a transitive relation, then {A => B} can
be used to infer {B => A}. Hence, if properly defined, new knowledge can be derived
CHAPTER 3. METHODOLOGY 65
from existing knowledge. Additionally, we can define necessary and sufficient
conditions on classes which can be used to logically classify instances into classes
[61]. For example, we could define a patent document to be a document with exactly
one Title and Abstract, and at least one Claim. This means, even if we don’t explicitly
state that a certain document with exactly one title and abstract, and at least one claim
is a Patent, it can be inferred. However, as mentioned in Chapter 2, the information
sources are very diverse and this leads to many issues to extract information from the
documents. Potential issues include erroneous or missing information and hence, if we
were to define very strict properties, then a patent document could be misclassified
because some information was missing. For this reason, we relax the properties on the
relations in our implementation of the patent ontology.
3.3.2 CONCEPTUALIZATION
Figures 3.8 and 3.9 show a conceptual view of the patent and court case
documents respectively. The relations between two entities (shown as a black line) are
directional from patents and court cases out to other classes, e.g. {Patent, hasTitle,
Title}. The relations are not symmetric and hence the inverse {Title, hasTitle, Patent}
does not hold true. In both Figures 3.8 and 3.9, we notice that the remaining classes
can be grouped under either metadata or textual information. This form of
classification helps to address all the metadata at once, instead of individually calling
out to each one. For example, if an application requested for all metadata of a patent,
using the ontology we can return all metadata entities such as Title, Date,
Classification, etc.. We can further group metadata and textual information into a
single parent node Information. When the patent and court case hierarchies are
combined, classes which are common to both documents will refer to the same
concept and not two different concepts.
CHAPTER 3. METHODOLOGY 66
This form of abstraction is not only possible for classes, but also for relations,
made possible by the rdfs:subPropertyOf construct. Court cases and Patents are related
to each of the classes shown in Figures 3.8 and 3.9. These relations, such as ‘hasTitle’,
‘hasAbstract’, and ‘hasPlaintiff’, etc., can also be abstracted into a common parent
relation ‘hasInformation’. This relation has a domain of either Patent or Court Case
and Information as a range.
Figure 3.9: Conceptual View of Court Case
Figure 3.8: Conceptual View of Patent Documents
CHAPTER 3. METHODOLOGY 67
File wrappers are not documents themselves, but in fact a collection of documents.
This makes modeling file wrappers trickier than the other documents such as patents
and court cases. Firstly, a vocabulary of all kinds of documents contained within the
file wrapper must be defined. Since each of these documents refers to a particular
event of communication between the applicant and patent office, we will call it Event
instead of document to avoid confusion between the class Document and a file
wrapper event. The events of importance to us are shown in Figure 3.10. We group
application events and office actions separately to allow representation of queries such
as – “Return all office actions for file wrapper A”. Each file wrapper event must be
individually modeled keeping in mind the information it contains. For example, each
examiner Rejection contains critical information such as – the allowed claims, the
rejected claims, and the withdrawn claims (see Figure 3.11). Similarly, other events
such as Interference, Restriction, and Amendments are also modeled in our patent
system ontology.
The Patent, Court Case, and File Wrapper classes shown in Figures 3.8-3.10 are
different types of documents available from different information sources. The patent
system comprises many such information sources and many such documents. In the
Figure 3.10: Events Contained in a File Wrapper
CHAPTER 3. METHODOLOGY 68
top level ontology for the patent system (shown in Figure 3.12), all types of
documents are abstracted into a single parent class (Document). The Document class
can be sub-classed any number of times to include other forms of documents such as
regulations and laws which are currently not in the scope of our study. The classes
Document, Information, and Event correspond to the three root nodes of the patent
system ontology. Additionally, the classes Inventor, Examiner, Author, and Judge,
etc., can be abstracted into a common parent node such as Person.
As mentioned earlier, information sources in the patent system implicitly cross-
reference one another (see Figure 3.13). These implicit cross-references serve as
relevancy measures when comparing documents from different information silos.
When manually comparing two documents, these cross-references are rather obvious
to the human eye. For example, a human could easily spot a reference to a patent
document in the court case. These references can very quickly help identify relevant
documents to a user query. The true power of the patent system ontology lies in the
ability to integrate information across multiple information sources. The patent system
ontology is extended to relate two classes or individuals from different domains to
explicitly represent the cross-references. Applications built around the patent system
ontology can dynamically derive relevancy based on the pre-defined cross-references.
Figure 3.11: Excerpt from the Patent System Ontology: Rejection class
CHAPTER 3. METHODOLOGY 69
Figure 3.13: Cross-Referencing between Documents in the Patent System
Figure 3.12: Top Level Ontology for the Patent System
CHAPTER 3. METHODOLOGY 70
3.3.3 POPULATING THE ONTOLOGY
The ontology is populated with information from actual physical documents from
the document repository. The XML files are parsed and for each parent-child node –
(a) Instances of both parent and of the child are created. If these instances already
exist, they are only updated with new information, if any
(b) The parent and child instances are related to one another through the
appropriate object or data-type property. If the property does not exist, it will
be created (see Figure 3.14).
The instantiation is done automatically using the standard Jena and Protégé Java
libraries. Once the instantiation is complete, an OWL reasoner such as Pellet is
triggered to check for consistency and make inferences. For example, an entity in the
class ‘Patent’ will be additionally classified as a ‘Document’, since ‘Patent’ is a
subclass of ‘Document’. The current version of the knowledge-base is populated with
the 1150 U.S. patents and 30 court cases from our corpus. Other patent documents
which may have been found in court cases or through patent citations, but not in the
Figure 3.14: Populating the Patent System Ontology
CHAPTER 3. METHODOLOGY 71
original 1150 documents are instantiated but contain no information about the patent
since the original document itself is unavailable in our corpus. However, we ignore
any documents which are not a part of our corpus when performing the tests. The file
wrapper for U.S. patent 5,955,422 has also been partially incorporated into the
knowledge-base. Currently, only the first amendment, rejection, interference and the
original application from the file wrapper are populated.
Triple stores are specialized databases to manage large amount of information
written in RDF [18,89,96]. Most triple stores also have limited support for OWL. Due
to the size of the ontology, we create a local instance of a triple store (Virtuoso) and
store all the triples in it. Using a triple store will allow us to scale our ontology to
millions of instances. Moreover, ontology editors such as Protégé require loading the
ontology each time the application is executed. The triple stores provide a persistent
store for the triples and significantly lower the loading time. Currently, the ontology
can be queries using SPARQL through both Protégé and Virtuoso interfaces [6,104].
3.3.4 USING THE DECLARATIVE SYNTAX: EXPRESSING QUERIES AND DEVELOPING
RULES
The patent system ontology provides declarative syntax (RDF and OWL) to
express queries and rules to embed heuristics. This section provides examples to
understand how this is possible.
3.3.4.1 Expressing Competency Questions as SPARQL queries
Table 3.3 shows examples of how we can represent any natural language question
in SPARQL to query the ontology, as long as the classes and relations required to
express the query are defined in the ontology. The queries do not always have to
return documents, but can return other classes like Inventors or Examiners as well.
These SPARQL queries will generally be handled at the application level and will be
CHAPTER 3. METHODOLOGY 72
abstracted from users. Applications can request any information they want from the
ontology. In fact, even the applications do not have to fully know the details of the
ontology. The ontology can be queried for all its relations for a particular class or
between two classes. For example, the query:
SELECT ?rel WHERE {
?pat type Patent .
?pat ?rel Information
}
will return all relations (variable ?rel) which have the class Patent as the domain. In
other words, all relations defined on patents such as hasTitle, hasAbstract,
hasIPCClass, etc., will be returned. Hence, updating the underlying ontology with new
information will automatically update the application using it as well.
3.3.4.2 Expressing Heuristics as Rules
Rules are declarative statements which operate over the entities defined in the
ontology. This provides a way to express relations that are more than simple binary
relations using if-then clauses. The Semantic Web Rule Language (SWRL), which
combines OWL and RuleML, extends the expressivity of OWL [62]. An inference
Table 3.3: Expressing Competency Questions in SPARQL
Competency Questions SPARQL Query
Return all court cases which involve the
company Amgen Inc. as the plaintiff
and from the District Court of
Massachusetts
SELECT ?case WHERE {
?case type CourtCase .
?case hasPlaintiff “Amgen Inc.” .
?case hasCourt “District Court…”
}
Return all patents which contain the
phrase ‘recombinant erythropoietin
receptor’ in the claims and IPC class
“A61K”
SELECT ?pat WHERE {
?pat type Patent .
?pat hasClaim ?clm .
?clm hasTerm1 “recombinant …” .
?pat hasIPCClass “A61K” .
}
CHAPTER 3. METHODOLOGY 73
engine or a reasoner executes the rules and infers new facts in the knowledge base.
SWRL however comes at a price of decidability and computational complexity [101].
The use of DL-safe rules for reasonable complexity is suggested [87]. We use the
Pellet reasoner and Jess inference engine to reason over the developed rules [41,117].
The rules are developed based on similarity heuristics between documents. Examples
of the heuristics for the rules are shown in Table 3.4.
The rules operate over the metadata and cross-referenced properties defined in the
patent system ontology and infer pairs of similar documents. In order to differentiate
between the inferences made by each rule, we define a property
hasSimilarDocument_* for each rule, where * indicates the identifier for the rule. This
allows us to apply several weighing schemes to the rules to distinguish between the
more general and the more important specific rules. To illustrate, consider the example
shown in Figure 3.15, where Patents 1 and 2 are both owned by the same company
‘Amgen’ and invented by the same inventor. According to our rule base, these patents
should be considered similar to one another according to at least two rules. However,
intuitively, a large company such as Amgen is likely to own patents covering a
broader range of topics than a single inventor would. If Amgen has ‘n’ patents, then
we will assume each link contributes a weight of 1/n. Similarly, if Inventor1 contains
Table 3.4: Expressing SWRL rules
Heuristic or Relevancy Metric SWRL Rule
Two patent documents by the
same inventor are potentially
similar
hasInventor(?pat1, ?inv1) ∧ hasInventor(?pat2,
?inv1) → hasSimilarDocument_1(?pat1, ?pat2)
Two patents that appear in court
litigation are potentially similar.
Also, the court case is related to
both the patents
patentsInvolved(?case, ?pat1) ∧
patentsInvolved(?case, ?pat2) →
hasSimilarDocument_2(?pat1, ?pat2) ∧
hasSimilarDocument_2(?case, ?pat1) ∧
hasSimilarDocument_2(?case, ?pat2) ∧
hasSimilarDocument_2(?pat1, ?case)
CHAPTER 3. METHODOLOGY 74
‘m’ patents, then each link has a weight of 1/m. Since n>m, the more general rules
would be assigned a lesser weight. The resulting similarity score between the
documents is a weighted sum of the number of rules that infer the two documents as
similar:
( ) ∑ ( )
where Wi represents the importance of the rule and inference(i) = 1 if ‘A
hasSimilarDocument_i B’ or 0 otherwise. For illustration purpose, in this paper we
simply give all rules equal weights and the score is equal to the number of rules that
have concluded that the two documents are similar.
3.4 IR FRAMEWORK
In IR, the information desired is seldom achieved with a single query. Queries are
typically reformulated several times based on intermediate search results until the
information need is satisfied [119]. This reformulation could include the addition of
synonyms, new search terms, and other constraints. When performing multi-source
Figure 3.15: Expressing Heuristics through Rules in Patent System Ontology
CHAPTER 3. METHODOLOGY 75
search, information obtained from searching one domain is applied to another. The
methodology in Sections 3.2 and 3.3 provide the backbone for automating this
process. While the domain ontologies ensure that the correct semantics are applied for
efficient retrieval, the patent system ontology standardizes domain representation and
integration. In this section, we present an Information Retrieval (IR) framework which
integrates the methodologies from Sections 3.2 and 3.3 in multiple stages to enhance
multi-source IR (see Figure 3.16):
1. Expand Query: In this stage, the user’s initial query is expanded according to
the methodology described in Sections 3.2.1. Manually selected bio-ontologies
act as the source for concepts and appropriate weight vectors are selected.
2. Search Information Sources: Information sources are independently searched
with applicable restrictions from Section 3.2.4, i.e. the scope of the search. The
required vocabulary and syntax for searching the information sources is
contained in the patent system ontology. For example, the patent system
ontology provides the syntax for searching the titles of documents – hasTitle:
‘erythropoietin’. The information sources are searched independently in this
stage to retrieve highly relevant documents from each source.
3. Cross-Reference Information: The cross-referenced information holds key for
Figure 3.16: IR Framework
CHAPTER 3. METHODOLOGY 76
multi-domain retrieval. The cross-references explicitly defined in the patent
system ontology are used as relevancy measures to correlate search results
between information sources. For example, a relation defined in the patent
system ontology – {caseA, patentsInvolved, patentA} will cause the
framework to extract patent numbers from the court case. These patent
numbers can be used to repeat or enhance the search for the patent domain.
Similarly, biomedical terminology can be extracted from one document and
used to search other documents. For example, if the drug ontology is used to
identify drugs in the abstract of publications, the newly identified drugs can be
used to search the patent domain. In fact, they can directly feedback into Step
1, where newly added terms can be expanded using the biomedical ontologies.
Also in Step 2, the new search terms could be limited to searching only the
claims of the patent.
4. User Feedback: Besides the diverse information and knowledge sources, the
users in the patent system domain area also come from a diverse background –
scientific/technical, legal, business, and more. The intention of the user must
be captured through the search process in order to ensure that the results
retrieved are indeed relevant to the user. User-relevancy feedback has been an
important part of IR research [8,84]. The user relevancy feedback stage is out
of this thesis’ scope and will not be discussed. However, user feedback is an
important component of the framework and will be included in future
implementation.
To illustrate the methodology and IR framework, consider the example shown in
Figure 3.17. Based on the initial query ‘erythropoietin’, Stage – I of the framework
expands the query based on the Drug Ontogy to:
Erythropoietin [Drug Ontology] = QInitial = {{erythropoietin, epo}, {epoetin alfa,
epogen, procit…}…}
CHAPTER 3. METHODOLOGY 77
In Stage – II of the framework, the expanded query is used to search the TREC
corpus and associated diseases such as anemia are extracted. These terms can be fed
back to Stage-I to re-apply expansion on the new terms to give us:
Anemia [Disease Ontology] = anemia, {aplastic anemia…}
Similarly,
ESRD [Disease Ontology] = {esrd, chronic kidney disease…}
In Stage – III, these new terms, are used to search the claims of the US patent
documents in conjunction with the original query to highly relevant patent documents
– {5,955,422, 5,547,933, 5,618,298, 5,620,868, 5,756,349, …}. This process can
continue as long as desired and operates on other information sources as well.
Figure 3.17: Example to Illustrate IR Framework
CHAPTER 3. METHODOLOGY 78
3.4.1 IMPLEMENTATION DETAILS
In this section, we provide a brief overview of the implementation of the IR
framework and its basic features. The IR framework is implemented entirely in Java
with abstractions of several modules that are critical for the system. Some of the
features of the IR framework include:
Feature Modules for query expansion such as the one explained in Section
3.2.1
Generic API for integration with sources of domain knowledge such as
BioPortal
Jena libraries and triple store integration for modifying the patent system
ontology through new constructs, cross-references or rules.
Solr and Lucene libraries to create, update and query the text indexes
Automatic query generation, abstracting the syntactic details from the user.
Automatic UI and search configuration through a pre-defined properties file.
The current implementation does not directly interface with the information
sources, rather interfaces to a local copy of the document repository. The work flow
(see Figure 3.18) is divided into two stages. The first stage, the offline phase consists
of – (a) parsing the document repository; (b) updating the references and rules in the
patent system ontology; and (c) creating or updating the text indexes according to the
patent system ontology. The patent system ontology is not directly queried for the
following reasons:
1. Semantic technologies are not scalable to larger amount of data.
2. Text mining libraries, such as Lucene, outperform triple store implementations.
The second stage, the online phase, involves the UI which communicates with the
text indexes and fetches domain knowledge dynamically from BioPortal. The tool
CHAPTER 3. METHODOLOGY 79
implements the four stages of the IR framework described in Section 3.4 in the
backend, while the UI is used to display search results and collect user feedback, etc..
3.5 RELATED WORK
There is a wealth of research done in the area of IR and related topics such as
Information Extraction, Document Summarization, Text Mining, Data Mining and
Machine Learning. The methodology discussed in this chapter is based on – (a)
knowledge-based methods, such as query expansion, which make use of domain
ontologies; and (b) using ontologies to achieve interoperability between information
sources to facilitate multi-domain searching. This section summarizes the works
closely related to our methodology.
Figure 3.18: Current Implementation of the IR Framework Methodology
CHAPTER 3. METHODOLOGY 80
3.5.1 KNOWLEDGE-BASED IR
Several studies in the recent years have made use of domain ontologies and
derived knowledge annotations for IR and related tasks [11]. GoPubMed is a search
engine which uses the MeSH and the GO ontologies to annotate and search the
PubMed index to retrieve biomedical publications [35]. The TREC Genomics track
(2003-2007) had several research groups working on information retrieval on a subset
of the PubMed index [59]. Some of the more successful methodologies employ the use
of domain knowledge, especially for synonymy [63,121]. Although the use of
synonymy is reported to be erratic [59], it accounts for a majority of the improvement
in the top performing systems. Domain knowledge has been used to improve retrieval
in the patent document space as well [45-47,88,132]. Mukherjea and Bamba use
knowledge sources to annotate the physical documents to improve recall [88]. Their
ranking mechanism, however, is based on non-semantic measures such as citation
counts. The use of domain knowledge for other related tasks such as summarization,
clustering and visualization has been shown [74,126,132]. The PATExpert project has
developed an ontology for patent documents which focuses on the European patent
system [46,47,132]. However most of the above methods are tuned to work with a
single information silo, and must be extended to work with multiple information
sources.
3.5.2 OTHER APPROACHES TO IR
Several methods approach document retrieval from a non-semantics perspective.
These methods typically use metadata information to cluster and classify relevant
documents. Citation analysis and link analysis typically focus on the incoming and
outgoing citations of a document [42,100]. Other general metadata based
methodologies focus rank the document based bibliographic information such as the
rank of a journal [31]. Kang et al cluster patent documents based on their technology
CHAPTER 3. METHODOLOGY 81
classification to improve retrieval [73]. Xue and Croft explore an automatic query
generation method to retrieve patent documents which extracts noun phrases from pre-
specified fields of the patent document [137]. However, these methodologies are
outperformed by knowledge-based methodologies. Potential future work could explore
how to best combine semantic methodologies with the others.
3.5.3 ONTOLOGY DEVELOPMENT AND INTEROPERABILITY
The nature of the problem we are addressing demands information which is
scattered across many diverse information sources in the patent system.
Interoperability between these information sources is essential to facilitate multi-
domain searching. A variety of ontology-based methods have been proposed for
integrating diverse knowledge domains [86,106,115,131]. While some support having
a single unified ontology for all purposes, having such an ontology is not scalable.
Furthermore, no one organization will take charge of maintaining the ontology.
Alternative architectures suggest having separate ontologies representing each
knowledge domain, and integrating them through either the application directly,
providing ontology mappings, or via a top level ontology [103,131,132].
In our methodology, we develop the patent system ontology, which provides
structural interoperability between the information sources. In some sense, we achieve
semantic interoperability by using domain ontologies to integrate information from
several domains. However, a much higher level of interoperability can be achieved if
legal ontologies and biomedical ontologies are combined. Examples of legal
ontologies include structural ontologies, technology classifications (USPC31
and
IPC32
) and so on. We develop the IR framework to combine the patent system
31 The United States Patent Classification codes can be accessed at –
http://www.uspto.gov/web/patents/classification/ (Accessed on 03/01/2012). 32
The International Patent Classification codes can be accessed at –
http://www.wipo.int/classifications/ipc/en/ (Accessed on 03/01/2012).
CHAPTER 3. METHODOLOGY 82
ontology and the domain knowledge from biomedical ontology as a first step towards
this goal.
Chapter 4.
PERFORMANCE EVALUATION
4.1 INTRODUCTION
Performance evaluations help establish aspects of the system that perform well and
give insight into how the methodology can be potentially improved upon. In this
chapter, we perform a formal evaluation of our methodology against the document
repository described in Chapter 2. In our methodology, the problem of retrieving
information across multiple sources in the patent system is tackled in multiple stages.
First, the query expansion methodology is integrated with domain knowledge to
improve retrieval from a single information source at a time. Next, the patent system
ontology is used to integrate information across multiple sources and retrieve a set of
highly relevant documents. Since both methodologies focus on different stages of the
IR framework, their experimental setups, and evaluation criteria differ. Hence, we
evaluate the query expansion methodology and the patent system ontology
independently.
The chapter is divided into three parts: Section 4.2 provides some necessary
background on SPARQL, a language to query RDF ontologies, and formal evaluation
measures used in IR such as precision and recall. Section 4.3 evaluates the
performance of the knowledge-based query expansion methodology on the documents.
CHAPTER 4. PERFORMANCE EVALUATION 84
The results are compared to baseline references that are generated by querying the
document corpus without the use of domain knowledge. Section 4.4 demonstrates the
functionality of the patent system ontology through use case scenarios based on two
applications – (1) patent prior art search, and (2) infringement analysis. A series of
questions that are typical in the applications is generated in order to query the
ontology. A summary of the discussion is provided in Section 4.5, abstracting the
benefits, and the limitations of the methodology, laying a strong foundation for future
experimentation and potential improvements. Section 4.5 provides a summary of the
discussions in the chapter.
4.2 BACKGROUND AND RELATED WORK
Many formal measures are defined in IR literature to evaluate the performance of
systems. In this section, we provide some background on the various formal measures
that will be used throughout the chapter. Specifically, section 4.2.1 defines recall,
precision, f-measure, average precision, document mean average precision, and
‘precision @ k’. The patent system ontology is evaluated based on a series of queries
representing two application use cases. The queries are written in SPARQL and
require some understanding of the syntax of the language. Section 4.2.2 provides a
brief overview on the SPARQL language and some common constructs that are used
in this chapter.
4.2.1 EVALUATION METRICS
The most common evaluation metrics in IR are recall and precision measures.
Statistically speaking, recall measures the coverage of the search, or the fraction of
relevant documents retrieved and can be defined as [84]:
CHAPTER 4. PERFORMANCE EVALUATION 85
where TP is the number of true positives and FN are the number of false negatives.
Precision measures the number of relevant documents out of the total number of
documents retrieved and is defined as [84]:
where FP is the number of false positives. A third measure used in IR, the F-measure,
is the harmonic mean of the precision and recall, and is defined as [84]:
The Average Precision is the mean of precision values calculated at each position
where a relevant result is found. The Mean Average Precision (MAP) is the mean of
the Average Precision for a set of queries over a corpus. The MAP measure is
increasingly being used to evaluate search results [59]. The TREC corpus uses MAP
to evaluate results at the passage level and document levels. However, we will use
MAP to only evaluate the results of the document retrieval.
Since most users only view the results from the top 10-30 hits, the precision and
recall measured over the entire set of results is not a highly relevant measure. The
precision at certain smaller values of retrieval are much more relevant. Thus, we report
‘precision @ k’, where k is the number of retrieved results at which the precision is
reported.
4.2.2 SPARQL
Over the years, several query languages for RDF graphs have been developed.
Some of the commonly used ontology query languages include RDF Query Language
(RDQL), SPARQL Protocol and RDF Query Language (SPARQL) and Semantic Web
CHAPTER 4. PERFORMANCE EVALUATION 86
Rule Language (SWRL) [62]. In this section, we present some background on
SPARQL as a means to query RDF graphs. SPARQL is syntactically very similar to
Structured Query Language (SQL), a language commonly used to query relational
databases. Similar to SQL, SPARQL is provides many features and clauses such as
CONSTRUCT, DESCRIBE and ORDER BY amongst many others enabling the
creation of complex queries. Although SPARQL is a query language for RDF, since
OWL is built over the RDF semantics, SPARQL can be used to query OWL
ontologies as well. The simplicity and ease of use of SPARQL, which is in-built in the
OWL API33
, has encouraged us to use SPARQL to query the patent system ontology.
The SPARQL queries that have been used in this chapter mainly consist of two
parts – the query variation, and the triples (see Figures 4.6 and 4.7 for example).
SPARQL provides different query variations that can be used to query RDF graphs.
These are SELECT, DESCRIBE, CONSTRUCT and ASK. We use the SELECT
keyword to extract raw values from the graph. The other variations are not used in our
work, but highly useful when dealing with RDF graphs. The query triples are used to
specify information that needs to be extracted. The triples are of the form “?subject
?predicate ?object” where any term with a leading question mark is a variable that
can match multiple entities. For example, “?subject a CourtCase” will return all
entities in the ontology, which are of type CourtCase. A detailed description of the
SPARQL query language is available in the W3C documentation [118].
4.3 KNOWLEDGE-BASED METHODOLOGY USING BIO-ONTOLOGIES
Terminological inconsistencies and high use of domain specific semantics render
pure term based methodologies ineffective. In order to retrieve relevant information, a
strong integration of domain knowledge is important. In addition to domain
33 The Javadocs for the OWL API can be found at -
http://owlapi.sourceforge.net/documentation.html (Accessed on 03/01/2012).
CHAPTER 4. PERFORMANCE EVALUATION 87
knowledge integration, methodologies must be able construct complex queries that are
capable of improving document retrieval in an automated fashion. The query
expansion methodology described in Section 3.2 attempts to achieve this by extracting
semantically relevant terms from external domain ontologies. In this section, we will
evaluate our query expansion methodology with a reference to existing literature and
baseline results. The results are supported with a thorough analysis.
The documents significantly differ in many aspects and thus, the methodology will
apply in different ways to each type of document. For this purpose, the methodology is
tested independently on the documents. We generate the baseline references by using a
simple term based model to query the document corpus. Additionally, for the
publication data set, the results from the 2007 TREC genomics competition are used
as a reference. Our goal is to improve performance with respect to these baseline
values and construct a strong foundation for future improvements.
This section is organized as follows: Section 4.3.1 queries the document corpus
without the use of domain knowledge, to generate the baseline estimates. Section 4.3.2
explains the general experimental setup for integrating domain knowledge and query
expansion. Based on the general experimental setup, Sections 4.3.2.1 and 4.3.2.2
independently evaluate the methodology on the patent and scientific publication
corpuses independently.
4.3.1 BASELINE
The first step in the evaluation is to establish the baseline references for
comparison in each document type. In our use case, the keyword ‘erythropoietin’ is
used to search through the patent database and to generate a baseline reference. The
search for ‘erythropoietin’ in the patent data set results in a large number of
documents. We use the 135 patents as the ground truths to calculate the precision and
recall measures. The benchmarking search results in a recall of 0.67 but with a low
CHAPTER 4. PERFORMANCE EVALUATION 88
precision of 0.125. As users rarely look beyond the top 10-20 search results [84], this
baseline search alone with a large number of documents is ineffective. In addition to
precision and recall results, we are especially interested in retrieving the five core
patents, since they are important in our use case. We compute the average rank of
these five core patents to further evaluate the effectiveness of our system. An average
rank of 3 would indicate all five core patents are retrieved in the top 5 results. The
average rank of the five core patents found to be 51.4 out of the 1150 patents. Table
4.1 lists the rank of the five core patents for the baseline search.
For the publication data set, the original topics in TREC are used without
modifications as queries to provide the baseline reference against our methodology. In
addition, the published results from the 2007 TREC genomics competition are also
used as a reference [59]. The baseline document MAP is 0.036, which is better than
the minimum document MAP of 0.032 but worse than the top scores and median
scores of 0.328 and 0.186 respectively, achieved in the TREC competition. The
baseline results are summarized in Table 4.2.
As discussed in Chapter 2, court cases are written for general consumption and
make little use of technical jargons. A search for the term ‘erythropoietin’ alone
retrieves all 30 relevant court cases resulting in a recall (and precision) of 100%. Since
our court case database is currently limited, we focus on evaluating our methodology
on the patent and scientific publication documents.
Table 4.1: Baseline Reference: Rank of Core Patents
Patent Number Rank out of 1150 Patents
5,547,933 49
5,621,080 50
5,618,698 51
5,955,422 53
5,756,349 54
CHAPTER 4. PERFORMANCE EVALUATION 89
4.3.2 QUERY EXPANSION
Term-based models search the underlying corpus for the terms specified in the
user queries. In Chapter 3, we illustrated that these terms alone are not sufficient to
retrieve documents due to the heavy use of synonymy in the documents. The
terminological inconsistencies are tackled by including synonyms along with the
original query terms to search the documents. The query expansion method explained
in Section 3.2 queries external knowledge sources such as domain ontologies to
extract the required semantics. In order to facilitate expansion, each query term is
treated as a concept, i.e. a collection of terms and phrases that are interchangeably
used in the texts of the documents. For example, the concept ‘erythropoietin’ is a
collection of the terms – {‘epo’, ‘erythropoietin’, ‘epoetin alfa’ …}. Additionally,
related concepts through hierarchical expansions are also included in the query to
provide a broader coverage. However, expanding the original query terms could also
potentially lead to imprecise results. In this section, we describe the experimental
setup to evaluate the knowledge-based query expansion methodology.
The first step in expanding queries is to map the terms in the query to the actual
concepts in the biomedical ontologies. The mapping is done by searching BioPortal
for the query terms and retrieving concept Uniform Resource Identifiers (URI). The
mapping process not only retrieves concepts that have the query term as a preferred
name for the concept, but also those concepts which have the query term listed as a
synonym. For example, the term ‘tumor’ is mapped to the concept ‘neoplasms’ in
Table 4.2: Baseline Reference for Evaluating the Query Expansion
Methodology
Type Recall Average Precision
Patent 0.67 0.125
Publications 0.76 0.0361
Court Case 1 1
CHAPTER 4. PERFORMANCE EVALUATION 90
MeSH. Once the concept URIs are fetched, the ontologies are traversed hierarchically
to retrieve parent and child concepts as well as the concepts in the several levels above
and below the concept. The newly added hierarchical concepts automatically include
their synonyms. For example, if ‘colony stimulating factors’ is identified as a parent
concept, its synonyms such as ‘csf’ and ‘mgif’ will also be included into the query as
parent concepts. The resulting expanded query will be of the form:
Qterm = term [ALL] =
[
]
where term [ALL] is used to indicate that initially ALL ontologies are searched. In
order to vary the depth of expansion, we use several weighting schemes such as
[ ]
[ ]
,
[ ]
[ ]
, and
[ ]
where WSyn, WPar, WGPar, WChi, WGChi represent expansions including only
synonymy, parent concepts up to one level, parent concepts up to two levels, child
concepts up to one level and child concepts up to two levels respectively. Similarly,
the terms can be expanded all the way up to the roots, or the leaves of the hierarchy.
Queries can consist many terms. Not all the query terms need to be expanded. Before
automatically expanding the queries, the queries are pre-processed to indicate which
terms need to be expanded. For example, the query “What tumor types are found in
Zebrafish?” is pre-processed to “What [neoplasms][MeSH] are found in Zebrafish
[NDF]?”. The processed query indicates that the term ‘neoplasm’ must be expanded
using the MeSH ontology and the term ‘Zebrafish’ must be expanded using the NDF.
This forms the basis for our experiments that are to follow. In the process of
CHAPTER 4. PERFORMANCE EVALUATION 91
experimentation, several modifications based on different weighting schemes and
different ontologies, etc., will be studied.
4.3.2.1 Query Expansion for Retrieval of Patent Documents
The query ‘[erythropoietin][ALL]’ is chosen as the starting point for expansion.
We aim to retrieve the relevant documents from the set of 1150 patents in our
repository. A search for the term ‘erythropoietin’ in BioPortal returns results from 4
ontologies – MeSH, NDF, NCI Thesaurus and GO. All four ontologies are used to
expand the query to up to two levels of parent and child concepts using the weighting
schemes described earlier. Figure 4.1 shows the recall and the average precision for
the query expansions on patent documents. As shown in Figure 4.1, using synonymy
Figure 4.1: Average Precision and Recall for Query Expansions on Patent
Documents
CHAPTER 4. PERFORMANCE EVALUATION 92
(i.e. the concept ‘erythropoietin’) alone does not improve recall. In fact, the average
precision drops to 0.114 when compared to the baseline reference of 0.124. However,
the hierarchical expansions show improvements in both recall and precision. The
addition of only one level of parents, or children improves recall to 0.97 and precision
to 0.131. While the results improve with the addition of immediate parent and child
concepts, adding concepts any farther away in the hierarchy do not change the results
significantly. This is mainly due to the fact that the ontologies provide very few terms
beyond immediate parents and children.
All forms of hierarchical query expansions result in about the same precision and
recall. Although it is difficult to distinguish between the different hierarchical results,
the expansions have an effect on the average rank of the five core patents. As we
expand to parent concepts that are farther away from the original concept, the average
rank of the 5 core patents deteriorates (e.g. above 450 for grandparents). Intuitively,
this makes sense because as we traverse higher up in the hierarchy to parent concepts,
we are generalizing the search. On the other hand, the average rank improves to
around 67 as we add child concepts. While this is still lower than the baseline search,
we attempt to improve the average rank with further experimentation. Table 4.3 shows
the average rank of the five core patents for the expanded queries.
The current weighting schemes give equal weights to all concepts. Having
achieved a high recall, we attempt to further improve the precision by applying
different weights to the terms. Based on the results, the expansion to child concepts
showed most improvement over the baseline reference. In order to study the effect of
weighting, we experiment with one level each of parent concepts and child concepts.
We define three heuristic weighting functions to analyze how they affect the search
results. These functions are as follows:
CHAPTER 4. PERFORMANCE EVALUATION 93
[ ] [
] [
], and [
]
However, the use of different weighting functions only has a marginal effect on the
results (e.g. for W3, precision goes up from 0.1310 to 0.1314). Ideally, these
weighting vectors should be automatically learnt from the corpus.
The four bio-ontologies used for expansion may share some terminology.
However, since they cover different sub-domains, the terminology may be classified
differently. This could lead to potential conflicts from the use of multiple ontologies.
For example, the NDF ontology states ‘epoetin alfa’ as a child concept of
‘erythropoietin’ where as NCI thesaurus states they are synonyms. One way of
resolving conflict is to give one level of concepts preference over the other. In our
expansion, we gave precedence to the concept that is closest to the leaves (i.e. child
concepts) in the tree hierarchy. A second way to resolve conflicts is to selectively use
ontologies. This may reduce the overall coverage, but in turn have a positive effect on
precision. We compare the use of individual ontologies versus the use of multiple
ontologies (see Figure 4.2). The domain knowledge provided by NDF improves the
Table 4.3: Change in Average Rank of Core Patents with Level of
Expansion
Level of Expansion Average Rank of Core Patents
Synonymy 133
Parents 428
Grandparents 469
Children 67
Grand Children 67
Parents and Children 232
Grandparents and
Grandchildren 270
CHAPTER 4. PERFORMANCE EVALUATION 94
precision to 0.161 when compared to the other ontologies. The recall drops from 0.97
to 0.95 which is acceptable in most cases. Clearly the NDF performs better than the
other ontologies. Upon examination, we realize that the terms extracted from the NDF
include industry standard drug names such as Epogen, which are commonly seen
across relevant documents. This implies that the selection of ontologies is an important
aspect in the expansion of queries. The average rank of the core patents also improves
to less than 50.
The low values of precision can be attributed to the fact that the concept
‘erythropoietin’ is used in many different contexts. This implies that ‘erythropoietin’
Figure 4.2: Comparison between use of Multiple Ontologies vs. Individual
Ontologies
CHAPTER 4. PERFORMANCE EVALUATION 95
itself is a general term, and hence a search for it in the patent database would return all
documents covering a wide range of aspects including its production, its composition,
etc.. Since the ground truth is defined by following forward and backward citations to
the five core patents, the ground truths themselves covers a wide range of topics
related to erythropoietin. While the query expansion using biomedical ontologies
improves recall by a significant amount, it is difficult to generate a query which covers
all 135 documents with a high precision. However, a more specific query can be
constructed by adding more clauses, in order to retrieve a subset of documents and as
a result improve precision. By adding more keywords and restrictions such as fields to
search (Title, Claims, etc., instead of the entire document) the size of the search results
will tend to be more manageable. Since the expected results are fewer, we measure the
precision @ 15 (precision at the first 15 retrieved documents). These results are
summarized in Table 4.4.
4.3.2.2 Query Expansion for Retrieval of Scientific Publications
The TREC data set provides 36 topics over which the methodologies are
evaluated. Each topic is a question asking for a list of specific entity types. The 14
Table 4.4: Precision and Average Rank of Core Patents for Fielded Search
on Patent Documents
Query Precision @
15
Avg. Rank of
Core Patents
‘Erythropoietin’ in All Fields 0.18 49.4
‘Production of Erythropoietin’ in All
Fields 0.23 47.4
‘Production of Erythropoietin’ in Title 0.50 3
‘Production of Erythropoietin’ in
Abstract 0.31 19.4
‘Production of Erythropoietin’ in
Claims 0.12 6.5
‘Production of Erythropoietin’ in
Description 0.21 41.8
CHAPTER 4. PERFORMANCE EVALUATION 96
entity types such as Proteins, Genes, and Diseases, etc., are based on terminology from
different biomedical sources such as MeSH and GO [7,90]. The rules allow us to
modify the original query, but the interaction with knowledge sources must be in an
automated fashion. For the analysis presented in this section, we restrict ourselves to
all entity types that can be extracted from the MeSH ontology. This results in a total of
10 topics from the original 36 specified in the TREC data set. The resulting 10 topics
are pre-processed to clearly specify the terms that must be used for expansion. The
terms that are to be expanded are renamed to match the exact concept name used in
MeSH to avoid any errors in mapping. For example, the entity type ‘Proteins’ is
renamed to ‘Amino Acids, Peptides or Proteins’. All other noun phrases are used to
query, but not expanded. Since the entity types are fairly general, we only expand to
the subclasses. In order to study the effect of the depth of expansion on retrieval, we
extract terms up to 7 levels of subclasses, starting at the entity type. Table 4.5
summarizes the modified queries, the selected knowledge sources, and the entity
types.
We developed a query parser and constructor which is responsible for query
formulation and ensures the automatically generated queries are syntactically correct.
The expanded terms are arranged in a series of ‘OR’ boolean clauses and replace the
original term in the query that was expanded. For example, in the query “[Tumor]
AND Zebrafish”, if [Tumor] is expanded to ‘Neoplasm’ ‘Leukemia’, and ‘nerve
sheath tumor’, then the original query will be automatically transformed as follows:
“[Tumor] AND Zebrafish” (“Neoplasm” OR “Leukemia” OR “Nerve Sheath
Tumor”) AND (Zebrafish)
Subtle modifications in the way the query is generated can result in unexpected
behavior. For example, if “nerve sheath tumor” was not enclosed in quotes as a
phrase query, the search would include the terms ‘nerve’, ‘sheath’ and ‘tumor’ in
CHAPTER 4. PERFORMANCE EVALUATION 97
separate OR clauses. This is different from searching for the phrase “nerve sheath
tumor” and may result in a low precision. The parser ensures that phrases are properly
enclosed in quotes to avoid inaccuracies. Figure 4.3 summarizes the results. The best
performance is observed at a depth of 3 with a document MAP of 0.199. These results
Table 4.5: Pre-Processed Queries to Evaluate Query Expansion on Scientific
Publications
TREC 2007
Topic Original Query Pre-processed Query
200 What serum [PROTEINS] change
expression in association with high
disease activity in lupus?
[Amino Acids, Proteins or
Peptides][MeSH] AND lupus
203 What [CELL OR TISSUE TYPES]
express receptor binding sites for
vasoactive intestinal peptide (VIP)
on their cell surface?
[Cells OR Tissues][MeSH] AND
(receptor binding sites) AND
(vasoactive intestinal peptide VIP)
AND (cell surface)
204 What nervous system [CELL OR
TISSUE TYPES] synthesize
neurosteroids in the brain?
[Cells OR Tissues][MeSH] AND
(nervous system) AND neurosteroid
AND brain)
211 What [ANTIBODIES] have been
used to detect protein PSD-95?
[Antibodies] AND PSD-95
215 What [PROTEINS] are involved in
actin polymerization in smooth
muscle?
[Amino Acids, Proteins or
Peptides][MeSH] AND "smooth
muscle"
217 What [PROTEINS] in rats perform
functions different from those of
their human homologs?
[Amino Acids, Proteins or
Peptides][MeSH] AND (rat AND
human AND homolog AND
function)
219 In what [DISEASES] of brain
development do centrosomal genes
play a role?
[Brain Diseases][MeSH] AND
(centrosome "brain development”)
220 What [PROTEINS] are involved in
the activation or recognition
mechanism for PmrD?
[Amino Acids, Proteins or
Peptides][MeSH] AND (involved in
the activation or recognition
mechanism for PmrD)
226 What [PROTEINS] make up the
murine signal recognition particle?
[Amino Acids, Proteins or
Peptides][MeSH] AND (murine
AND signal AND particle AND
recognion)
231 What [TUMOR TYPES] are found
in Zebrafish?
[Tumor][MeSH] AND Zebrafish
CHAPTER 4. PERFORMANCE EVALUATION 98
are a significant improvement over the baseline queries and compare to average
performances in the 2007 TREC competition results.
Upon further examination, we realize some queries perform better than the others
(see Figure 4.4). The reason for this is because many of the terms actually appearing in
Figure 4.3: Effect of Depth of Query Expansion on Retrieval of Scientific
Publications
Figure 4.4: Performance of Query Expansion on Individual Topics
CHAPTER 4. PERFORMANCE EVALUATION 99
the text of the publications are not available under the selected concept for expansion.
For example, in the query, “[Brain Diseases][MeSH] AND (centrosome "brain
development”)”, the ground truth contains – ‘Schizophrenia’. MeSH classifies this
under a different parent and not ‘[Brain Diseases]’. Since we only extract subclasses
of the concept ‘Brain Diseases’, the term ‘Schizophrenia’ is never retrieved from
MeSH. Hence, choosing appropriate domain knowledge and mapping the query terms
to the correct concepts becomes important. For example, if we used ‘Central Nervous
System Diseases’ as our starting concept for expansion, the term ‘Schizophrenia’
would have been retrieved, improving search results. However, this will drastically
increase the number of query terms resulting in long querying times. Figure 4.5 shows
the increase in the number of query terms as the depth of expansion increases. Our
goal is to choose a depth of expansion that gives us good results, and yet provides
reasonable query times. If we consider only those queries for which appropriate
domain knowledge is available, a high MAP can be achieved.
In most disciplines, journals share only the metadata and Abstracts of publications
publicly instead of the full-text. In order to see how our methodology would perform
Figure 4.5: Number of Query Terms with Increasing Depth of Query
Expansion
CHAPTER 4. PERFORMANCE EVALUATION 100
on the PubMed index (which includes Abstracts, Article Titles, and other metadata
such as Journal Titles, Date of Publication, etc.), we restrict the query to only
Abstracts of the documents. The search results in a Document MAP of 0.100. While
this value is lower than the searches performed on full-text, it is still a significant
improvement over the baseline values.
The proximity of the query terms to one another is also an important factor to be
considered. Generally, if concepts in a query are very far apart in a document, the
document is less likely to be relevant to the query. We modify the query parser to
generate proximity queries such that the original boolean query is modified as follows:
(“Neoplasm” OR “Leukemia” OR “Nerve Sheath Tumor”) AND (Zebrafish)
“Neoplasm Zebrafish”~100 OR “Leukemia Zebrafish”~100 OR “Nerve Sheath
Tumor Zebrafish”~100
where “Neoplasm Zebrafish”~100 implies that the terms ‘neoplasm’ and ‘zebrafish’
must be within 100 words of each other. The proximity queries perform extremely
well for some queries, but decrease the overall document MAP to 0.052.
There are various other characteristics of scientific publications that can be
exploited to improve retrieval of documents. The MeSH descriptors are indexed along
with scientific publications to indicate the general theme of the topic. Especially in the
absence of full-text, MeSH descriptors have been shown to improve retrieval in
conjunction with searching Abstracts [63]. Other forms of experimentation may
include expanding more than one set of terms in the query. For example, in the query
“[Tumor][MeSH] AND Zebrafish”, the term ‘Zebrafish’ could also be expanded to
include synonyms such as ‘Danio Rerio’ such that the query becomes:
“[Tumor][MeSH] AND [Zebrafish][MeSH]”
CHAPTER 4. PERFORMANCE EVALUATION 101
4.4 EVALUATING PATENT SYSTEM ONTOLOGY AND IR FRAMEWORK
Our goal is to facilitate the retrieval of a collection of relevant documents across
multiple information sources in the patent system. The diversity of the information
sources combined with little or no interoperability between the sources, imposes a
serious challenge for such retrieval. Our patent system ontology, described in Section
3.3, provides a standardized representation for the various types of documents and
explicitly relates them based on the cross-references. As a result, the patent system
ontology facilitates information integration across multiple types of documents. This
section evaluates the patent system ontology based on its capability to answer a series
of questions, generated to represent two use case scenarios – (1) a patent prior art
search, and (2) infringement analysis. The queries, partly borrowed from the
competency questions described in Section 3.3.1, are translated into equivalent
SPARQL queries. The ontology is queried through a Virtuoso SPARQL end point
[96]. The main focus is on illustrating he use of cross-references, although formal
measures such as precision and recall are provided where applicable. Since the current
implementation of the patent system ontology does not include scientific publications,
we constrain the evaluation to patent documents, court cases and file wrappers.
The standardized terminology to represent documents in the patent system
ontology can potentially serve as a backbone for applications. Applications can query
the patent system ontology for required terminology, or guidelines. For example, an
application can request the patent system ontology to explain the contents of a patent
document. This would result in response that indicates the various metadata and
textual fields contained in a patent document, and their relationships with other
documents.. Additionally, the declarative syntax can be used to express heuristics in
the form of rules to represent similarity measures, or guidelines for applications to
follow. We evaluate a simple rule-based methodology to express similarity heuristics
via the patent system ontology through an example.
CHAPTER 4. PERFORMANCE EVALUATION 102
The rest of this section is organized as follows: Sections 4.4.1 and 4.4.2 describe
the use case scenarios and evaluate the patent system ontology through a series of
well-constructed queries. Section 4.4.3 illustrates the rule-based similarity measures
through an example.
4.4.1 USE CASE SCENARIO: PATENT PRIOR ART SEARCH
A patent prior art search is required during both the acquisition phase and the
enforcement phase of the patent system. For example, a patent examiner may want to
do a prior art search in order to examine a patent application, or an inventor may need
to determine the patentability of an invention. The prior art search is done to ensure
the patentability of the invention, i.e. it is novel and non-obvious. Patent prior art can
be any printed publication in the form of patents, scientific publications, or even PhD
theses. However, for this example, we will limit the prior art to issued patents and
court litigations.
Patent prior art searches are driven by heuristics and strategies that vary from user
to user. However, most users follow a general outline. The search is based on
exploring and learning information from results and constantly refining query.
Typically, the first step in patent prior art research is to search using a keyword that
broadly relates to the information need. Considering the volume of patent documents,
a search for a broad keyword could result in several thousands of patents. For
example, a search for the concept ‘protein’ returns over 100,000 documents. This is
also seen from the results of Section 4.3.1.1, where the concept ‘erythropoietin’
covered a large collection of documents. It is possible to reduce the search space by
adding more terms or constraints to the query such as field restrictions, date
restrictions, etc.. With a reduced search space, it is possible to scan the abstracts of
some patent documents and identify the important technology classes. Searching for
the keywords under those specific classes will result in more patent documents which
CHAPTER 4. PERFORMANCE EVALUATION 103
may or may not be relevant. After identifying some relevant patents, some of the
possible next steps could be to follow the forward and backward citations, study the
patents of the most relevant inventor or assignee, etc., to get more relevant results. At
every stage, new keywords can be added and this process is typically repeated until the
results start to converge. The search is then independently applied to the patent
application database, and scientific publications, etc..
Patents which have been involved in court cases have an obvious importance and
provide a good starting point for conducting the patent prior art search. In this
exercise, we choose to first search the court case documents and then extract relevant
patents.
Step – I: Search for all court cases containing the term ‘erythropoietin’
The SPARQL query shown in Figure 4.6 searches all documents of the type court
case for the concept ‘erythropoietin’. To perform this search, first the bodies of the
court cases are retrieved. We use the FILTER REGEX clause to search the extracted
text via the ‘resourceVal’ property to retrieve only those court cases which contain the
term ‘erythropoietin’. Ideally, in the IR framework, the term based search will be
handled by the knowledge-based query expansion method. As mentioned in Section
4.2.1, all 30 court documents are returned for the baseline query ‘erythropoietin’.
SELECT DISTINCT ?pat
WHERE {
?case a CourtCase .
?case hasBody ?body .
?body resourceVal ?text .
FILTER REGEX (?text, "erythropoietin", "i") .
}
Figure 4.6: SPARQL Query to Retrieve Court Cases Related to
Erythropoietin
CHAPTER 4. PERFORMANCE EVALUATION 104
Hence, for the purpose of demonstrating the patent system ontology, we continue to
use SPARQL’s FILTER REGEX clause.
Step – II: Enlist the patents involved in these court cases
The query in Fig. 4.7 requests for all the patents which have been involved in the
30 court cases related to ‘erythropoietin’ via the ‘patentsInvolved’ property. 11 patent
documents are retrieved with a precision 0.72. It must be noted that not all 11 patents
may be present in our corpus of 1150 patents. Hence during our instantiation process,
no further information about these patents such as inventors, assignees, etc., may be
available.
Step – III: Identify the U.S. class, inventors and assignees of these patents
For all the patents retrieved in Step-II that are available in the knowledge base, we
identify the most prominent inventors, the assignees, and technology classes the
SELECT DISTINCT ?pat
WHERE {
?case a CourtCase .
?case hasBody ?body .
?body resourceVal ?text .
FILTER REGEX (?text, "erythropoietin", "i") .
?case patentsInvolved ?pat .
}
Results
5411868
5621080
5547933
5618698
5756349
5955422
5441868
4703008
4677195
5547993
5322837
Figure 4.7: SPARQL Query to Retrieve Patents Involved in Court Cases
Related to Erythropoietin
CHAPTER 4. PERFORMANCE EVALUATION 105
patents fall under. This is done by adding SPARQL query triples that request for
individuals along the ‘hasUSClass’, ‘hasInventor’ and ‘hasAssignee’ properties on the
extracted patents. Figure 4.8 summarizes these results. By removing the DISTINCT
clause, it is possible to get an estimate of which of these results occur the most. The
figure shows the top 5 occurring technology classes, inventors, and assignees for the
query triplets that are added.
Step – IV: Extract patents with specified technology class, inventors or assignees
The extracted technology classes, inventors, and assignees are used to query the
patent corpus to extract additional patents that were not initially retrieved through the
term-based search. The query is shown in Figure 4.9 and the results are summarized in
Table 4.6. The new patents retrieved based on the inventors’ result in a higher
precision when compared to the technology classes or assignees. This is because the
set of inventors is specific, while the technology classes and assignees cover a broader
SELECT ?usclass ?inv
?assignee
WHERE {
?case a CourtCase .
?case hasBody ?body .
?body resourceVal ?text .
FILTER REGEX (?text,
"erythropoietin", "i") .
?case patentsInvolved ?pat
.
?pat hasUSClass ?usclass
.
?pat hasAssignee
?assignee .
?pat hasInventor ?inv .
}
Results
US Class Inventor Assignee
514/8 Lin, Fu-Kuen Kirin-Amgen, Inc.
530/350
Hewick, Rodney,
M.
Amgen, Inc.
536/23.51 Seehra, Jasbir, S. Kiren-Amgen, Inc.
435/325 Seenra, Jasbir, S. Genetics Institute, Inc.
435/69.6
Figure 4.8: SPARQL Query to Extract US Patent Classification, Names of
Assignees and Inventors from Patent Documents
CHAPTER 4. PERFORMANCE EVALUATION 106
range of topics.
Step – V: Search backward citations of the patents
Alternatively, the backward US patent citations are extracted for each of the 11
patents returned by the query shown in Figure 4.10. Many of these patents can have
overlapping backward citations; however, with the DISTINCT clause the size of the
resulting list of patents is around 40. This query results in a precision of 0.93, with a
recall of 0.29. If we also search the forward citations, we will generate a larger list of
SELECT ?usclass ?inv ?assignee
WHERE {
{ ?pat hasInventor Lin_Fu-Kuen .}
UNION
{ ?pat hasInventor Seenra_Jasbir_S . }
UNION
…
{ ?pat hasAssignee Genetics_Institute_Inc . }
UNION
{ ?pat hasAssignee Kiren-Amgen_Inc .}
…
}
Figure 4.9: SPARQL Query to Extract Patent Documents Related to a Set of
Inventors, Assignees and/or US Patent Classification
Table 4.6: Precision for Results Obtained by Querying Patent System
Ontology for Documents Related to a Set of Inventors, Assignees or US
Classification
Query Precision
Top 5 Technology Classes 0.183
Inventors 0.8
Assignees 0.256
Combined 0.186
CHAPTER 4. PERFORMANCE EVALUATION 107
patents, some of which may be highly relevant. The ground truths for the patent set
were developed by following the forward and backward citations. Hence, this query is
expected to yield high precision results, but is discussed for demonstration purposes.
The knowledge base can be incrementally searched based on the results obtained
in Figures 4.6-4.10. Furthermore, the court cases and scientific publications can be
searched for cross-referenced entities such as the newly retrieved patents or inventor
names, etc..
4.4.2 USE CASE SCENARIO: FILE WRAPPER EXAMPLE
In this section, we build on the previous example of the patent prior art search, to
illustrate the process of infringement analysis. An infringement analysis is necessary
to enforce the rights of a patent and prevent others from infringing the inventor’s
rights. An infringement analysis is typically conducted by three different parties – (1)
the company whose patent is being infringed, (2) the company who is infringing the
patent, and (3) the court. Literal infringement is the type of an infringement, where the
claim of one patent literally states the exact same limitations as the claim in another
patent. Literal infringement cases are easy to resolve, but extremely rare [113]. When
the claims of two patents do not literally infringe, it is important to determine the
scope of each limitation of the claim under the ‘doctrine of equivalents’ [113]. For
SELECT DISTINCT ?pat2
WHERE {
?case a CourtCase .
?case hasBody ?body .
?body resourceVal ?text .
FILTER REGEX (?text, "erythropoietin", "i") .
?case patentsInvolved ?pat .
?pat hasCitation ?pat2 .
}
Figure 4.10: Querying Patent System Ontology for Backward Citations
CHAPTER 4. PERFORMANCE EVALUATION 108
this, the patent’s entire file history will have to be studied, and the focus is set on the
wordings of the claim and how they evolved. As in the previous example, a series of
questions are developed to represent an infringement analysis use case.
Figure 4.7 shows the list of patents involved in these court cases. Among these
patents, US patent 5,955,422 is identified as a very frequently occurring patent, which
also happens to be one of Amgen’s core patents. We choose to study the file wrapper
of the US patent 5,955,422 to analyze the evolution of the claims.
Step-I: Enlist the contents of the file wrapper
The query shown in Figure 4.11 displays all the events contained within the file
wrapper. This list is obtained via the ‘contains’ property. We order the results by the
date in which they occurred. Notice that the initial application (07/609741) and the
final issued patent (5,955,422) are both part of the file wrapper.
One of the important aspects of a litigation is to determine the priority date34
of a
patent. The patent system ontology enables us to view the nature of the application,
34 The priority date of a patent application is the date used to establish the novelty and non-
obviousness of the invention. Priority dates can also date back to parent applications.
SELECT DISTINCT ?doc
WHERE {
ont:FileWrapper_5955422
ont:contains ?doc .
?doc ont:hasDate ?date .
}
ORDER BY ?date
Results
Type Name
Patent
Application 07_609741
Applicant Event 07/609741_Amendment_1
Applicant Event 07/609741_Interference_1
Office Action 07_609741_Rejection_1
Applicant Event 07/957073_Amendment_1
Issued Patent 5955422
Figure 4.11: SPARQL Query to Display Contents of a File Wrapper, Ordered
by the Date
CHAPTER 4. PERFORMANCE EVALUATION 109
i.e. whether filed as a continuation, continuation-in-part, divisional or a fresh
application and determine the original priority date that applies to the claims.35
If this
application is a continuation or a divisional, in a more complex query, it will be
possible to trace back to the priority dates, i.e. the parent application.
Step – II: Extract the initial claims
The initial claims as filed by the applicant are generally very different from what
are finally allowed. When determining the scope of the claims, the differences
between initial claims and the final accepted claims provide important information.
The scope of the claims is determined by the added limitations36
which make the claim
acceptable. The issued patent by itself will not contain the original claims. However
from a file wrapper, this information can be extracted as shown in Figure 4.12.
Step – III: Study the examiner’s rejection
Figure 4.13 provides a snapshot of the subsequent rejection by the examiner as an
35 For definitions of legal terms, please refer – http://www.uspto.gov/main/glossary/ (Accessed
on 03/01/2012). 36
Limitations are individual clauses which form a single patent claim.
SELECT DISTINCT ?claim ?text
WHERE {
07_609741 hasClaim ?claim
.
?claim resourceVal ?text
}
Claim Claim Text
1 A purified and isolated polypeptide having part or all
of the primary structural conformation … naturally
occurring erythropoietin and characterized by being
the … of an exogenous DNA sequence.
2 A polypeptide according to claim 1 further
characterized by being free of association with any
mammalian protein.
10 A polypeptide according to claim 1 which has the in
vivo biological activity of naturally occurring
erythropoietin.
Figure 4.12: SPARQL Query to Extract the Text of Claims from the Original
Patent Application
CHAPTER 4. PERFORMANCE EVALUATION 110
instance of the Rejection Class (taken from protégé). The rejection provides
information regarding claims that are allowed, withdrawn, and other claims that are
disallowed under a restriction.37
Restrictions can be viewed via the ‘hasRestriction’
property. Since the document is in the form of a letter, the text of the restriction is
stored and accessible via the ‘resourceVal’ property.The actual text of the rejection
letter is also included under the ‘resourceVal’ annotation property. This facilitates
searching for information that is not explicitly modeled, such as relevant U.S.C. codes
or other regulations that may have led to the rejection or restriction. From Figure 4.13,
37 For the definition of legal terms, please refer - http://www.uspto.gov/main/glossary/ (Accessed
on 03/01/2012).
Figure 4.13: Class View of Patent Examiner’s Restriction in File Wrapper for
US Patent 5,955,422
CHAPTER 4. PERFORMANCE EVALUATION 111
we see that out of the original 63 claims, claims 1-60 are withdrawn, and of the
remaining 3 claims, claims 61-62 are accepted and claim 63 is rejected.
Step – IV: Compare rejected claims and accepted claims
The claims that are allowed can be accessed via the ‘allowedClaim’ property. The
text of the claims can also be viewed as shown in Figure 4.12. The difference in the
two claims is very subtle. Claim 62 states –
“A preparation according to claim 61 containing a therapeutically effective amount of
erythropoietin.”,
and claim 63 states –
“A composition according to claim 61 containing a therapeutically effective amount of
recombinant erythropoietin.”
However, claim 63 is rejected on the grounds of being too vague. In a similar fashion,
we can compare the text of the claims at every stage of the prosecution of the
application including the final claims to identify the added limitation which made the
claims acceptable to the examiner.
This process of querying the file wrapper can continue as long as desired. The true
potential of the ontology will be visible when complex queries spanning more than
one information domain are presented. The ontology takes advantage of the highly
cross-referenced information and provides the required semantics to jump from one
domain to another with ease. However this is still a daunting task to perform
manually. The semantics will allow methodologies to automatically process the
information and answer complex queries. The fine granularity of the ontology can
support different applications and users.
CHAPTER 4. PERFORMANCE EVALUATION 112
4.4.3 OTHER BENEFITS OF THE PATENT SYSTEM ONTOLOGY
The standardized representation of the patent system ontology allows us to use
declarative syntax such as SPARQL and SWRL to query the ontology and define
rules. In Section 3.3.4, we described how SWRL rules can be defined to specify
similarity heuristics over the patent system ontology. We defined 10 rules for
similarity which operate over the metadata and cross-references. In this section, we
present an example to illustrate how these rules can be used to infer document
similarity. We use the Jess rule engine to perform forward chaining and trigger the
rule engine. Three related patent documents – patent 5,955,422 (Doc1), patent
4,677,195 (Doc2) and 4,999,291 (Doc3) are shown in Figure 4.14. In order to compute
similarity between the documents, we use the Abstract of Doc1 to query the patent
index. Without the use of domain knowledge, the similarity score of Doc1 and Doc3
low. Using the query expansion methodology, the similarity score between the
Figure 4.14: Example to Illustrate a Simple Rule-Based Similarity Measure
CHAPTER 4. PERFORMANCE EVALUATION 113
documents improves to 0.59. The results give Doc3 a higher score than Doc2.
However, Doc1 and Doc2 are highly related and have been challenged together in
several court litigations. This relationship between the documents is not captured
through the bio-ontologies. The rule-based inferences identify this relationship and
give the two documents a similarity score of 0.2 (i.e. 2 out of 10 rules infer that Doc1
and Doc2 are similar). Hence, appropriate linear combination of the two results will
give Doc2 a higher score than Doc3 in the results.
4.5 SUMMARY
This chapter evaluates the performance of our methodology. We provide a brief
background on formal evaluation measures such as precision, recall, and f-measure. In
addition, we discuss the average precision and ‘precision @ k’ measures, commonly
used in IR literature and TREC competitions as alternatives. In order to help
understand the evaluations of the patent system ontology better, some background on
SPARQL, a query language for RDF, is provided. The knowledge-based query
expansion methodology and the patent system ontology are evaluated independently.
To evaluate the query expansion methodology, first, we generated baseline
references for each type of document. For patents and court cases, we used the query
‘erythropoietin’ to query the document corpus, without any added knowledge to
generate the baseline reference. For publications, we used the original TREC queries
without any modification to generate the baseline references. Additionally, we use the
results from the 2007 TREC genomics competition as a reference. Since our court case
corpus is limited, we continue the analysis on patents and scientific publications.
The queries are generally expanded to several levels of concepts above and below
the original terms. By using weighting functions, the expanded query is evaluated at
different levels such as parents only, children only, and parents and children, etc.. In
order to automatically query the domain ontologies, we pre-process queries to specify
CHAPTER 4. PERFORMANCE EVALUATION 114
which terms need to be expanded. Once the pre-processing of the queries is complete,
the knowledge base is queried for related concepts. A query parser ensures that the
expanded query is syntactically correct.
For patent documents, the effect of expansions is clearly seen in terms of recall,
but a little improvement over the baseline precision is observed. Weighting the terms
in the query also showed little improvement in results. Since the five core patents are
important to our use case, we also compute their average rank. The average rank of the
core patents shows improvement subclasses of ‘erythropoietin’ are added to the
expanded query. Traversing to superclasses in the hierarchies of the ontologies makes
the search more general decreasing the average rank of the core patents. Generally, the
precision for this data set is observed to be low, irrespective of the query. This is
because the data set covers a broad range of topics related to ‘erythropoietin’. In order
to narrow the scope of the search, we add additional terms in the query as constraints
and restrict the queries to specific fields on the patent document such as the Abstract,
or the Title. The ‘precision @ 15’ and the average rank of the core patents both
improved significantly over the baseline reference for this search.
For publications, we report results using the document MAP measure, which is
also used in the TREC competitions. This allows us to compare our methodology with
the results from the TREC competitions. The scope of our evaluation involves only the
queries for which terms that can be extracted from the MeSH ontology. The query
expansion methodology shows a significant improvement in the document MAP. The
results are comparable to other related works demonstrated in TREC. However, plenty
of scope for improvement is observed. We realize that some queries perform better
than others. The reason for the inconsistent performance among queries is the
insufficient domain knowledge provided by the domain ontologies. Hence, the
selection of appropriate domain ontologies is critical for good performance.
Expanding to deeper levels of subclasses below the original concept in the hierarchies
CHAPTER 4. PERFORMANCE EVALUATION 115
shows only a marginal improvement in results, but a drastic growth in the number of
query terms, increasing the time to execute the query.
Since PubMed indexes only metadata and Abstracts of scientific publications, the
full-text for most publications is not readily available. We evaluated our methodology
using only the Abstracts of documents instead of the full-text. The results are poorer
than full-text retrieval, but significantly better than the baseline results. However,
more experimentation is required to improve the performance using only Abstracts.
The patent system ontology provides a standardized representation for the different
types of documents, enabling information to be integrated. Our main focus was to
illustrate the use of the cross references to relate documents from multiple sources.
The patent system ontology is evaluated through two use case scenarios – a patent
prior art search, and an infringement analysis example. A list of questions is generated
based on typical questions that arise in the use case scenarios. These questions are
translated into SPARQL queries to query the patent system ontology. The cross-
references provide strong relevancy measures helping us quickly identify important
documents. In addition the use of metadata such as technology classifications, inventor
and assignee names, etc., show improvement in the results.
The patent system ontology can also be used to express heuristics through SWRL
rules. We discuss an example of expressing simple similarity heuristics through rules.
The example shows that good heuristics can be used to improve similarity rankings
between relevant documents. In general, in addition to the interoperability provided by
the patent system ontology, the declarative syntax allows additional knowledge related
to the patent system to be encoded into the ontology. Applications which build around
the patent system ontology can derive any additional information such as similarity
heuristics, dynamically from the ontology.
Chapter 5.
CONCLUSION AND FUTURE WORK
Advancements in computer science and information technology have enabled us to
address serious issues with respect to information growth and management in the
science and technology space. In this chapter, we will provide a brief summary of our
methodology to retrieve information from multiple diverse information sources in the
patent system. Based on the developed framework, potential future directions are
discussed.
5.1 SUMMARY
There is a tremendous growth in research and developments in science and
technology. Intellectual Property (IP) related information for science and technology is
distributed across several heterogeneous information silos. The scattered distribution
of information, combined with the enormous sizes and complexities, make any attempt
to collect IP-related information for a particular technology a daunting task. Hence,
there is a need for a software framework which facilitates semantic and structural
interoperability between the diverse and un-coordinated information sources in the
patent system. Such a framework would form a basis for information integration and
retrieval across multiple sources. This thesis presents a methodology and a framework
toward improving retrieval of information from the patent system.
CHAPTER 5. CONCLUSION AND FUTURE WORK 117
We developed a repository of documents comprising of (a) issued patents; (b)
federal and district patent litigations; (c) scientific publications; and (d) file wrappers.
Specifically, we developed the repository around a use case in the biomedical domain,
erythropoietin, which is a hormone that is responsible for the production of red blood
cells. The document repository consists of 1150 issued patents, around 30 court
litigation documents, 162,000+ scientific publications from the TREC 2007 Genomics
data set38
, and the file wrapper for US patent 5,955,422. Common challenges faced in
collecting documents from the information sources include – (1) varying publication
formats such as HTML, XML and image files; (2) incompatible or missing interfaces
and web services to access information; and (3) unstructured representation of
information. Parsers are developed to automatically download and extract important
information. The extracted metadata and textual information are stored in well-marked
up XML files. The repository is made searchable through text indexes and a search
interface constructed using Apache Lucene and Apache Solr, respectively [5,6].
Based on the document repository, we discuss the underlying methodology to
integrate and search information across multiple information sources. First, we discuss
a knowledge-based query expansion methodology to enhance document retrieval
within a single information source. Domain knowledge is extracted from BioPortal, a
library of over 250 biomedical ontologies that have been created and maintained by
subject experts. Next, a patent system ontology is developed to improve structural
interoperability between information sources. The ontology provides the necessary
semantics for integrating information across multiple sources. Finally, the IR
framework provides an iterative multi-domain search methodology that combines the
knowledge-based query expansion methodology and the patent system ontology.
Several examples are presented to illustrate the methodology.
38 For details regarding the 2007 TREC genomics data set, see
http://ir.ohsu.edu/genomics/2007data.html. Accessed on 03/01/2012/
CHAPTER 5. CONCLUSION AND FUTURE WORK 118
The knowledge-based query expansion methodology is evaluated for patent
documents and scientific publications. Results are communicated through standard
measures such as recall and precision. The expanded queries showed improved
performance over the unexpanded queries for both types of documents and are
comparable to other successful implementations demonstrated in TREC [59]. The
patent system ontology is evaluated through queries spanning multiple information
sources. The queries simulate real world patent-related applications namely prior art
searches and infringement analysis. Results show that the methodology not only
improves retrieval, but also allows custom search strategies to be encoded into the IR
framework through the patent system ontology. Several limitations of the
methodology are identified during the analysis and are discussed appropriately.
5.2 FUTURE WORK
The development of our methodology is an important step toward intelligent
applications for multi-domain IR. In this section, we discuss potential research
directions based on our work. First, research in the application of digital repositories to
manage information in the patent system is suggested. We then explain the importance
of user relevancy feedback and related techniques to enhance user interaction with the
search process. Several related methodologies for IR are discussed that can be built on
top of the current methodology. Finally, extensions for the current methodology are
suggested as future research directions to scale to other technology domains and
information sources.
5.2.1 DIGITAL REPOSITORIES
Our methodology heavily relies on the textual content and metadata available from
the information sources. For the purpose of prototyping the methodologies, we
downloaded documents from the sources to a local repository. If we were to directly
CHAPTER 5. CONCLUSION AND FUTURE WORK 119
interface with the information sources, their diversity and little interoperability pose a
major limitation to our methodology. Digital repository software tools such as Fedora
and DSpace are increasingly being adopted by institutions to publish and preserve
internal research and data [82,125]. Digital repositories support standard protocols for
information exchange and representation such as the Open Archives Initiative –
Protocol for Metadata Harvesting (OAI-PMH) and the Dublin Core Metadata
Initiative (DCMI), which facilitate easy integration with other information on the
internet.39,40
A great deal of information, especially in the US patent system, is still preserved
and expressed through images and other multimedia. Several researchers have
attempted to study retrieval of images and other forms of non-textual content [80,132].
In addition to documents, digital repositories support many forms of digital media
such as images, audio, and other forms of multimedia. Additionally, domain
ontologies can be superimposed on top of the repositories enabling knowledge-based
methodologies such as ours, to be easily integrated with digital repositories. Many
information sources in the scientific publication domain such as IEEE and ACM
already make use of such digital repositories [2,64]. Future research could study the
impact of digital repositories as tools for information management in the government.
5.2.2 USER RELEVANCY FEEDBACK
Information in the patent system is consumed by a wide range of users of both
technical and legal expertise, from lawyers and patent examiners, to technical
organizations. The information needs for users vary from one another. For example, a
39 Open Archives Initiative – Protocol for Metadata Harvesting (Accessed on 03/01/2012).
http://www.openarchives.org/OAI/openarchivesprotocol.html 40
Dublin Core Metadata Initiative Specifications (Accessed on 03/01/2012).
http://dublincore.org/specifications/
CHAPTER 5. CONCLUSION AND FUTURE WORK 120
lawyer performing an invalidity search maybe interested in the legal aspect of the
documents while a technical startup company performing a patentability search maybe
interested in learning and applying the technology. Understanding such diverse
information needs from short queries is a hard problem. Hence, inputs to the IR
system must include contextual information about user and relevancy feedbacks in
addition to standard queries.
User relevancy feedback has been studies and shows promise toward improving
retrieval [71,109,119]. In addition, good user interface and user experience are
important to capture user feedback. We implement one such feature, faceting, which is
the process of aggregating the results over a defined property of the system. For
example, if a property ‘Inventors’ is defined for patent documents, a facet on
‘Inventors’ will show the number of documents belonging to individuals of the type
‘Inventors’. Faceting along such properties can provide implicit information (quick
statistics in terms of counts) regarding the prominent entities in the results and will
enable users to quickly narrow down to relevant results. Other features include tag
clouds [26], co-occurrence graphs etc. [69]. Although user relevancy feedback is out
of the scope of the current research, we implemented faceting as a feature in the search
tool (Section 3.4.1) using Apache Solr libraries [6].
5.2.3 QUERY EXPANSION, SEMANTIC INDEXING AND OTHER METHODOLOGIES
Query expansion techniques modify user queries by appending new terms that are
derived from either external sources or from the search results [11,84]. As a result,
query expansion techniques are not bound to any technology domain, as long as
relevant domain knowledge is available. A limitation to this approach is that queries
can often become lengthy and may result in undesired side effects such as delays in
the retrieval of results, or overload the servers that host the text indexes. As an
alternative, some studies have explored the possibility of indexing the terms along
CHAPTER 5. CONCLUSION AND FUTURE WORK 121
with the domain knowledge such that the queries will still return the same results
without having to query external knowledge sources and thus reducing query time
[88]. However, this methodology does not have the flexibility to dynamically choose
domain knowledge as the query expansion methodologies, but limited to the domain
knowledge already indexed. Every time a new domain ontology is required, the entire
index will have to be re-built, consuming additional space and effort. A potential
future direction could study how these two methodologies can be integrated into a
hybrid approach such that more common domain ontologies are indexed along with
the documents, and other ontologies are queried dynamically when needed. Other
related research in the areas of natural language processing [14,33], distributed
semantic computing [25], and application of tensor algebra [12] are also valuable
additions to the current prototype.
5.2.4 SCALING TO MORE APPLICATIONS, MORE DATA SOURCES, AND MORE
SUBJECT DOMAINS
The scope of this thesis involves IP-related information for a biomedical use case.
However, the patent system covers a wide range of technology areas such as
environmental engineering, mechanical devices etc.. The domain knowledge for other
technologies may not be as advanced and complete as the biomedical domain. Hence,
for technologies which have little or no domain knowledge already available,
automatic ontology learning is a promising field of study to learn the required domain
knowledge [103,139]. Furthermore, the patent system involves several other
information sources such as laws, regulations and other agency repositories such as the
FDA drug database that are also valuable sources of knowledge. Future research
directions can explore how our methodology will scale to other subject domains, and
other information sources.
BIBLIOGRAPHY
1. 35 U.S.C. Sec. 103 (United States Code). “Conditions for Patentability; Non-
Obvious Subject Matter,” 2010.
2. ACM Digital Library. http://dl.acm.org/ (Accessed on 03/01/2012)
3. Alani, H., and Brewster, C., “Ontology Ranking based on the Analysis of
Concept Structures, ” In Proceedings of the Third International Conference on
Knowledge Capture ( K-CAP 05), Banff, Canada, 2005.
4. Amati, G. and Van Rijsbergen, C., J., “Probabilistic Models of Information
Retrieval Based on Measuring the Divergence from Randomness,” ACM
Trans. Inf. Syst., 20 (4):357-389, October 2002.
5. Apache Lucene. http://lucene.apache.org/
6. Apache Solr. http://lucene.apache.org/solr/
7. Ashburner, M., Ball, C., A., Blake, J., A., Botstein, D., Butler, H., Cherry, J.,
M., Davis, A., P, Dolinski, K., Dwight, S., S., Eppig, J., T., Harris, M., A.,
Hill, D., P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., C.,
Richardson, J., E., Ringwald, M., Rubin, G., M. and Sherlock, G., “Gene
Ontology: Tool for the Unification of Biology,” The Gene Ontology
Consortium., Nature Genetics, 25 (1):25-29, May 2000.
8. Baeza-Yates, R. and Ribeiro-Neto. B., Modern Information Retrieval, ACM
Press, 1999.
9. Baron, J., R. and Thompson, P., “The Search Problem Posed by Large
Heterogeneous Data Sets in Litigation: Possible Future Approaches to
BIBLIOGRAPHY 123
Research,” Proceedings of the 11th International Conference on Artificial
Intelligence and Law (ICAIL 2007), Stanford, CA, Jun 4-8, 2007.
10. Berners-Lee, T., Hendler, J. and Lassila, O., “The Semantic Web,” Sci. Am.,
284 (5):34–43, 2001.
11. Bhogal, J., Macfarlane, A. and Smith, P., “A Review of Ontology Based Query
Expansion,” Information Processing and Management, 43 (4):866-886, July
2007.
12. Biswas, A., Mohan, S. and Mahapatra, R., “Semantic Technologies for
Searching e-Science Grids,” In H. Chen et.al (eds), Semantic e-Science, Annals
of Information Systems, 11:141-187, 2010.
13. Bizer, C., Heath, T. and Berners-Lee, T., “Linked Data - The Story So Far,”
International Journal on Semantic Web and Information Systems, 5 (3), 2009.
14. Blake, C., “Beyond Genes, Proteins, and Abstracts: Identifying Scientific
Claims from Full-Text Biomedical Articles,” Journal of Biomedical
Informatics, 43 (2):173-189, April 2010.
15. Bodenreider, O. and Stevens, R., “Bio-Ontologies: Current Trends and Future
Directions,” Brief Bioinform, 7 (3):256–274, September 2006.
16. Bodenreider, O., “The Unified Medical Language System (UMLS): Integrating
Biomedical Terminology,” Nucleic Acids Research, 32(1):267.270, January
2004.
17. Branin, J., J., “Institutional Repositories,” In Drake, M. A. (Ed.), Encyclopedia
of Library and Information Science, Boca Raton, FL: Taylor & Francis Group,
LLC, pp. 237-248, 2005.
18. Broekstra, J., Kampman, A. and Harmelen, F., V., “Sesame: A Generic
Architecture for Storing and Querying RDF and RDF Schema”, The Semantic
Web – ISWC 2002, Lecture Notes in Computer Science, 2342:54-68, 2002.
19. Brown, S., H., Elkin, P., L., Rosenbloom, S., T., Husser, C., Bauer, B., A.,
Lincoln, M., J., Carter, J., Erlbaum, M. and Tuttle, M., S., “VA National Drug
BIBLIOGRAPHY 124
File Reference Terminology: A Cross-Institutional Content Coverage Study,”
Stud. Health Technol. Inform., 107(1):477-81, 2004.
20. Bruijn, J., D. et al., “State-of-the-art Survey on Ontology Merging and
Aligning,” V1. SEKT-project report D4.2.1 (WP4), IST-2003-506826, 2003.
21. Bruninghaus, S. and Ashley, K., D., “Improving the Representation of Legal
Case Texts with Information Extraction Methods,” Proceedings of the 8th
International Conference on Artificial Intelligence and Law (ICAIL), St. Louis,
Missouri, pp. 42-51, 2001.
22. Buitelaar, P., Eigner, T. and Declerck, T., “OntoSelect: A Dynamic Ontology
Library with Support for Ontology Selection,” In Proceedings of the Demo
Session at the International Semantic Web Conference, Hiroshima, Japan,
2004.
23. Center for Drug Evaluation and Research, Office of Epidemiology and
Biostatistics, “COSTART: Coding Symbols for Thesaurus of Adverse
Reaction Terms,” 4th ed. Bethesda, M: US Food and Drug Administration,
Publication PB93-209138, 1993.
24. Chimaera Website. http://www.ksl.stanford.edu/software/chimaera (Accessed
on 03/01/2012).
25. Cohen, T. and Widdows, D., “Empirical Distributed Semantics: Methods and
Biomedical Applications,” Journal of Biomedical Informatics, 42:390-405,
2009.
26. Collins, C., Viegas, F., B. and Wattenberg, M., "Parallel Tag Clouds to
Explore and Analyze Faceted Text Corpora," IEEE Symposium on Visual
Analytics Science and Technology, pp. 91-98, October 2009.
27. Crow, R., “The Case for Institutional Repositories: A SPARC Position Paper,”
The Scholarly Publishing and Academic Resources Coalition, Washington,
DC, 2001.
BIBLIOGRAPHY 125
28. De Nicola, A., Missikoff, M. and Navigli, R., "A Software Engineering
Approach to Ontology Building,” Information Systems, 34(2):258-275, 2009.
29. Deerwester, S., Dumais, S., Landauer, T., Furnas, G. and Harshman, R.,
“Indexing by Latent Semantic Analysis,” J. Amer. Soc. Info. Sci., 41:391-407,
1990.
30. Derwent World Patent Index.
http://thomsonreuters.com/products_services/legal/legal_products/a-
z/derwent_world_patents_index/ (Accessed on 03/01/2012).
31. Devezas, J., L., Nunes, S. and Ribeiro, C., “FEUP at TREC 2010 Blog Track:
Using H-Index for Blog Ranking,” In The Nineteenth Text REtrieval
Conference Proceedings (TREC 2010), 2010.
32. Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R., S., Peng, Y., Reddivari, P.,
Doshi, V. and Sachs, J., “Swoogle: A Search and Metadata Engine for the
Semantic Web,” In Proceedings of the thirteenth ACM International
Conference on Information and Knowledge Management (CIKM '04), ACM,
New York, NY, USA, pp. 652-659, 2004.
33. Dingare, S., Finkel, J., and Nissim, M., Manning, C. and Grover, C., “A
System For Identifying Named Entities in Biomedical Text: How Results From
Two Evaluations Reflect on Both the System and the Evaluations,” The 2004
BioLink Meeting: Linking Literature, Information and Knowledge for Biology,
ISMB, 2004.
34. DocketX. https://www.docketx.com/ (Accessed on 03/01/2012).
35. Doms, A. and Schroeder, M., “GoPubMed: Exploring Pubmed with the Gene
Ontology,” Nucleic Acids Research, 33:783-786, July 2005.
36. Eaton, A., D., “HubMed: A Web-Based Biomedical Literature Search
Interface,” Nucl. Acids Res. (1 July 2006), 34 (2):W745-W747, 2006.
37. Ekstrom, J. A.. Lau, G. T., Spiteri, D., Cheng, J. C. P. and Law, K. H.,
"MINOE: A Software Tool to Evaluate Ocean Management in the Context of
BIBLIOGRAPHY 126
Ecosystems,” Coastal Management, 38(5):457-473, First published on: 21 July
2010 (iFirst)
38. European Patent Office. http://www.epo.org/ (Accessed on 03/01/2012).
39. Fellbaum, C., “WordNet,” Theory and Applications of Ontology: Computer
Applications, pp. 231-243, 2010.
40. Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, M., and Sinclair, G.,
“Exploiting Context for Biomedical Entity Recognition: From Syntax to the
Web,” Joint Workshop on Natural Language Processing in Biomedicine and
its Applications, Coling, 2004.
41. Friedman-Hill, E., “Jess, the Rule Engine for the Java Platform,”
http://herzberg.ca.sandia.gov/jess/ (Accessed on 03/01/2012).
42. Fujii, A., “Enhancing Patent Retrieval by Citation Analysis,” In Proceedings of
the 30th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, New York, pp. 793-794, 2007.
43. Garfield, E., “New International Professional Society Signals the Maturing of
Scientometrics and Informetrics,” The Scientist, 9 (16), 1995.
44. German Patent Office. http://www.dpma.de/english/index.html (Accessed on
03/01/2012).
45. Ghoula, N., Khelif, K. and Dieng-Kuntz, R., "Supporting Patent Mining by
using Ontology-Based Semantic Annotations,” IEEE/WIC/ACM International
Conference on Web Intelligence (WI'07), pp. 435-438, 2007.
46. Giereth, M., Brugmann, S., Stabler, A., Rotard, M. and Ertl, T., “Application
of Semantic Technologies for Representing Patent Metadata,” First
International Workshop on Applications of Semantic Technologies, 2006.
47. Giereth, M., Koch, S., Kompatsiaris, Y., Papadopoulos, S., Pianta, E., Serafini,
L. and Wanner, L., “A Modular Framework for Ontology-Based
Representation of Patent Information,” Proceeding of the 2007 Conference on
Legal Knowledge and Information Systems: JURIX 2007, 165:49-58, 2007.
BIBLIOGRAPHY 127
48. Golbeck, J., Fragoso, G., Hartel, F., Hendler, J., Parsia, B. and Oberthaler, J.,
“The National Cancer Institute’s Thesaurus and Ontology,” Journal of Web
Semantics, 1(1), 2003.
49. Google and USPTO. http://www.google.com/googlebooks/uspto.html
(Accessed on 03/01/2012).
50. Google Patents. http://www.google.com/patents (Accessed on 03/01/2012).
51. Google Scholar. http://scholar.google.com/ (Accessed on 03/01/2012).
52. Griliches, Z., “Patent Statistics as Economic Indicators: A Survey,” Journal of
Economic Literature, 4:1661–1707, 1990.
53. Gruber, T., R., “Toward Principles for the Design of Ontologies used for
Knowledge Sharing,” Int. J. Hum.-Comput. Stud., 43(5-6):907-928, November
1995.
54. Gruninger, M. and Fox, M., S., “Methodology for the Design and Evaluation
of Ontologies,” In: Proceedings of the Workshop on Basic Ontological Issues
in Knowledge Sharing, IJCAI-95, Montreal, 1995.
55. Guarino, N., “Formal Ontology and Information Systems,” 1998.
56. Guijarro, L., “Interoperability Frameworks and Enterprise Architectures in e-
Government Initiatives in Europe and the United States,” Government
Information Quarterly, 24 (1):89-101, January 2007.
57. Guijarro, L., “Semantic Interoperability in eGovernment Initiatives,” Comput.
Stand. Interfaces, 2008.
58. Hein Online IP Library. http://heinonline.org/ (Accessed on 03/01/2012).
59. Hersh, W. and Voorhees, E., “TREC Genomics Special Issue Overview,”
Information Retrieval, Special Issue on TREC Genomics Track: Guest Editor:
Ellen Voorhees, 12(1):1-15, 2009.
60. Hofmann, T., “Probabilistic Latent Semantic Indexing,” In Proceedings of the
22nd Annual ACM Conference on Research and Development in Information
Retrieval, Berkeley, California, pp. 50-57, 1999.
BIBLIOGRAPHY 128
61. Horridge, M. A Practical Guide To Building OWL Ontologies Using Protégé 4
and CO-ODE Tools. The University of Manchester, March 2011.
62. Horrocks, I., Patel-Schneider, P., F., Boley, H., et. al, “SWRL: A Semantic Web
Rule Language Combining OWL and RuleML,” W3C Member Submission, 21
May 2004.
63. Ide, N., C., Russell, F., L. and Demner-Fushman, D., “Essie: A Concept-Based
Search Engine for Structured Biomedical Text,” J Am Med Inform Assoc.,
14:253-263, 2007.
64. IEEE Xplore Digital Library. http://ieeexplore.ieee.org/Xplore/guesthome.jsp
(Accessed on 03/01/2012).
65. IFW Insight. http://ifwinsight.com/ (Accessed on 03/01/2012).
66. Iglesias, J., P., Agüera, P., J., R., Víctor, F. and Yuval, F., Z, “Integrating the
Probabilistic Models BM25/BM25F into Lucene,” 30 Nov 2009.
67. Jaffe, A., B., Trajtenberg, M. and Henderson, R., “Geographic Localization of
Knowledge Spillovers as Evidenced by Patent Citations,” The Quarterly
Journal of Economics, 108 (3):577-598, 1993.
68. Japan Patent Office. http://www.jpo.go.jp/ (Accessed on 03/01/2012).
69. Jensen, L., J., Saric, J. and Bork, P., “Literature Mining for the Biologist: From
Information Retrieval to Biological Discovery,” Nature Reviews Genetics,
7:119-129, February 2006.
70. Jonquet, C., Musen, M. A. and Shah, N. H., “A System for Ontology-Based
Annotation of Biomedical Data,” International Workshop on Data Integration
in The Life Sciences 2008, DILS'08, Evry, France, Springer-Verlag, 5109,
Lecture Notes in BioInformatics, pp. 144-152, 2008.
71. Jordan, C. and Watters, C., “Extending the Rocchio Relevance Feedback
Algorithm to Provide Contextual Retrieval,” Advances in Web Intelligence,
Lecture Notes in Computer Science, 3034:135-144, 2004.
BIBLIOGRAPHY 129
72. Kalfoglou, Y. and Schorlemmer, M., “Ontology Mapping: The State of the
Art,” Knowl. Eng. Rev., 18(1):1-31, Jan 2003.
73. Kang, I., Na, S., Kim, J. and Lee, J., “Cluster-Based Patent Retrieval,”
Information Processing and Management, 43(5):1173-1182, Sep 2007.
74. Khelif, K., Hedhili, A. and Collard, M., “Semantic Patent Clustering for
Biomedical Communities,” Proceedings of the 2008 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent
Technology, 1:419-422, 2008.
75. Klein, D. and Manning, C., D., “Accurate Unlexicalized Parsing,” Proceedings
of the 41st Meeting of the Association for Computational Linguistics, pp. 423-
430, 2003.
76. Klein, T., E., Chang, J., T., Cho, M., K., et al., “Integrating Genotype and
Phenotype Information: An Overview of the PharmGKB Project,”
Pharmacogenomics, 1:167–70, 2001.
77. Lau, G., T. A Comparative Analysis Framework for Semi-structured
Documents, with Applications to Government Regulations. Ph.D. Thesis,
Department of Civil and Environmental Engineering, Stanford University,
Stanford, CA, August 2004.
78. LexisNexis. http://www.lexisnexis.com/en-us/home.page (Accessed on
03/01/2012).
79. Li, H., Councill, I., Lee, W., C. and Giles, C., L., “CiteSeerX: An Architecture
and Web Service Design for an Academic Document Search Engine,” In
Proceedings of the 15th International Conference on World Wide Web (WWW
'06), ACM, New York, NY, USA, pp. 883-884, 2006.
80. Liu, Y., Zhang, L., G. and Ma, W., Y., “A Survey of Content-Based Image
Retrieval with High-Level Semantics,” Pattern Recognition, 40 (1):262-282,
January 2007.
BIBLIOGRAPHY 130
81. Lopez, M., F., Perez, A., G. and Juristo, N., “METHONTOLOGY: from
Ontological Art towards Ontological Engineering,” In Proceedings of the AAAI
‘97 Spring Symposium, Stanford, USA, pp. 33-40, March 1997.
82. MacKenzie, S. et. al, “DSpace: An Open Source Dynamic Digital Repository,”
D-Lib Magazine, 9 (1), January 2003.
83. Maiga, G., and Williams, D., “A Flexible Approach for User Evaluation of
Biomedical Ontologies,” International Journal of Computing and ICT
Research, 2:62-74, December 2008.
84. Manning, C., D., Raghavan, P. and Schutze, H. An Introduction to Information
Retrieval. Cambridge University Press, 2009.
85. MEDLINE. http://www.nlm.nih.gov/pubs/factsheets/medline.html (Accessed
on 03/01/2012).
86. Mitra, P., Wiederhold, G. and Decker, S., “A Scalable Framework for
Interoperation of Information Sources,” Proceedings of the 1st International
Semantic Web Working Symposium (SWWS `01), Stanford University,
Stanford, CA, July 29-Aug 1, 2001.
87. Motik, B., Sattler, U. and Studer, R., “Query Answering for OWL-DL with
Rules,” Web Semantics: Science, Services and Agents on the World Wide Web,
3(1):41-60, Rules Systems, July 2005.
88. Mukherjea, S. and Bamba, B, “BioPatentMiner: An Information Retrieval
System for Biomedical Patents,” In Proceedings of the Thirtieth international
Conference on Very Large Data Bases, 30:1066-1077, 2007.
89. Mulgara triplestore. Available online: http://www.mulgara.org/ (Accessed on
03/01/2012).
90. National Library of Medicine, "Medical Subject Headings (MeSH) Fact
sheet,” May 2005.
91. Navigli, R., “Word Sense Disambiguation: A Survey,” ACM Comput. Surv.,
41(2), Article 10, 69 pages, 2009.
BIBLIOGRAPHY 131
92. NCBI Entrez Cross Database Search. Entrez. (Accessed on 03/01/2012).
http://www.ncbi.nlm.nih.gov/Entrez/
93. Noy, N., F. and McGuinness, D., “Ontology Development 101: A Guide to
Creating your First Ontology,” Stanford Knowledge Systems Laboratory
Technical Report KSL-01-05 and Stanford Medical Informatics Technical
Report SMI-2001-0880, March 2001.
94. Noy, N., F., “Semantic Integration: A Survey of Ontology-Based Approaches,”
SIGMOD Rec., 33(4):65-70, Dec 2004.
95. Noy, N., F., Shah, N., H., Whetzel, P., L., Dai, B., Dorf, M., Griffith, N.,
Jonquet, C., Rubin, D., L., Storey, M., A., Chute, C., G. and Musen, M., A.,
“BioPortal: Ontologies and Integrated Data Resources at the Click of a
Mouse,” Nucl. Acids Res., 37(2):W170-W173, 2009.
96. OpenLink Virtuoso. http://virtuoso.openlinksw.com/ (Accessed on
03/01/2012).
97. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C. and Johnson, D.,
“Terrier Information Retrieval Platform,” In Proceedings of ECIR 2005, Vol.
3408:517-519, Lecture Notes in Computer Science, Springer, 2005.
98. Dean, M. and Schreiber, G. (Eds.). OWL Web Ontology Language Reference.
W3C Recommendation, 10 February 2004.
99. PACER. http://www.pacer.gov/ (Accessed on 03/01/2012).
100. Page, L., Brin, S., Motwani, R. and Winograd, T., “The PageRank Citation
Ranking: Bringing Order to the Web,” Technical Report Stanford InfoLab,
1999.
101. Parsia, B., Sirin, E., Grau, B., C., Ruckhaus, E. and Hewlett, D., “Cautiously
Approaching SWRL,” Technical Report, University of Maryland, 2005.
102. Pedersen, T., “A Simple Approach to Building Ensembles of Naive Bayesian
Classifiers for Word Sense Disambiguation,” In Proceedings of the 1st North
American Chapter of the Association for Computational Linguistics
BIBLIOGRAPHY 132
Conference (NAACL 2000), Association for Computational Linguistics,
Stroudsburg, PA, USA, pp. 63-69, 2000.
103. Perez, A., G., Lopez, M., F. and Corcho, O., “Ontological Engineering: With
Examples from the Areas of Knowledge Management,” E-Commerce and the
Semantic Web, Advanced Information and Knowledge Processing, Springer-
Verlag, New York, Inc., Secaucus, NJ, USA, 2007.
104. Protégé Website. http://protege.stanford.edu/ (Accessed on 03/01/2012).
105. PubMed. http://www.ncbi.nlm.nih.gov/pubmed/ (Accessed on 03/01/2012).
106. Ray, S., “Interoperability Standards in the Semantic Web,” Journal of
Computing and Information Science in Engineering, ASME, 2:65-69, March
2002.
107. Resource Description Framework (RDF) Model and Syntax, W3C
Recommendation, 22 February 1999.
108. Robertson, E., S., Walker, S., Jones, S., Hancock-Beaulieu, M. and Gatford,
M., “Okapi at TREC-3,” In Proceedings of the Third Text REtrieval
Conference (TREC 1994), Gaithersburg, USA, November 1994.
109. Rocchio, J., J., “Relevance Feedback in Information Retrieval,” In Salton
(1971b), pp. 313-323, 1971.
110. Sabou, M., Lopez, V., Motta, E. and Uren, V., “Ontology Selection: Ontology
Evaluation on the Real Semantic Web,” In Proceedings of the 4th
International EON Workshop, Evaluation of Ontologies for the Web, colocated
with WWW2006, 2006.
111. Salton, G., Wong, A. and Yang, C., S., "A Vector Space Model for Automatic
Indexing," Communications of the ACM, 18 (11):613–620, 1975.
112. Scholl, H., J., “E-Government Integration and Interoperability: Framing the
Research Agenda,” Ralf Klischewski International Journal of Public
Administration, 30 (8-9), 2007.
BIBLIOGRAPHY 133
113. Schox, J. Not So Obvious: A Guide to a Patent Law and Strategy for Inventors
and Entrepreneurs. 2010.
114. Sheremetyeva, S., “Natural Language Analysis of Patent Claims,” In
Proceedings of the ACL Workshop on Patent Corpus Processing, Sapporo,
2003.
115. Sheth, A., P., “Changing Focus on Interoperability in Information Systems:
From System, Syntax, Structure to Semantics,” In Interoperating, Geographic
Information Systems, pp. 5-30, 1998.
116. Shinmori, A., Okumura, M., Marukawa, Y. and Iwayama, M., “Patent Claim
Processing for Readability: Structure Analysis and Term Explanation,” In
Proceedings of the ACL-2003 Workshop on Patent Corpus Processing,
Sapporo, Japan, Association for Computational Linguistics, Stroudsburg, pp.
56–65, 2003.
117. Sirin, E., Parsia, B., Grau, B., C., Kalyanpur, A. and Katz, Y., “Pellet: A
Practical OWL-DL Reasoner,” Web Semantics: Science, Services and Agents
on the World Wide Web, 5 (2):51-53, Software Engineering and the Semantic
Web, June 2007.
118. Prud'hommeaux, E. and Seaborne, A. SPARQL W3C Submission.
http://www.w3.org/TR/rdf-sparql-query/ (Accessed on 03/01/2012).
119. Spink, A., “A User-Centered Approach to Evaluating Human Interaction with
Web Search Engines: An Exploratory Study,” Information Processing and
Management, 38(3):401-426, May 2002.
120. Stave C., D. Field Guide to MEDLINE: Making Searching Simple. National
Library of Medicine (US), Philadelphia, PA: Lippincott Williams & Wilkins,
2003.
121. Strohman, T., Metzler, D., Turtle, H. and Croft, W., B., "Indri: A Language
Model-Based Search Engine for Complex Queries," Proceedings of
International Conference on New Methods in Intelligence Analysis, 2004.
BIBLIOGRAPHY 134
122. Symptom Ontology. (Accessed on 03/01/2012).
http://symptomontologywiki.igs.umaryland.edu/wiki/index.php/Main_Page
123. Thomson Delphion. http://www.delphion.com/ (Accessed on 03/01/2012).
124. Thomson Web of Science. (Accessed on 03/01/2012).
http://thomsonreuters.com/products_services/science/science_products/a-
z/web_of_science/
125. Thornton, S., Wayland, R. and Payette, S., “The Fedora Project: An Open-
Source Digital Object Repository Management System,” D-Lib Magazine, 9
(4), April 2003.
126. Trappey, A., J., C., Trappey, C., V. and Wu, C., Y., “Automatic Patent
Document Summarization for Collaborative Knowledge Systems and
Services,” Journal of Systems Science and Systems Engineering, 18 (1):71-94,
2009.
127. Uschold, M. and Gruninger, M., “Ontologies: Principles, Methods, and
Applications,” Knowledge Engineering Review, 11 (2):93-155, 1996.
128. Uschold, M., “Creating, Integrating, and Maintaining Local and Global
Ontologies,” Proceedings of the First Workshop on Ontology Learning (OL-
2000) in conjunction with the 14th European Conference on Artificial
Intelligence (ECAI-2000), Berlin, Germany, 2000.
129. USPTO. http://www.uspto.gov/ (Accessed on 03/01/2012).
130. Verberne, S., D’hondt, E., Oostdijk, N. and Koster, C., H., “Quantifying the
Challenges in Parsing Patent Claims,” In Proceedings of the 1st International
Workshop on Advances in Patent Information Retrieval (AsPIRe 2010), Milton
Keynes, UK, pp 14–21, 2010.
131. Wache, H., Vogele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann
H. and Hubner, S., “Ontology-Based Integration of Information - A Survey of
Existing Approaches,” In Proceedings of IJCAI-01 Workshop: Ontologies and
Information Sharing, Seattle, WA, pp. 108-117, 2001.
BIBLIOGRAPHY 135
132. Wanner, L., Baeza-Yates, R., Brugmann, S., Codina, J., Diallo, B., Escorsa, E.,
Giereth, M., Kompatsiaris, Y., Papadopoulos, S., Pianta, E., Piella, G.,
Puhlmann, I., Rao, G., Rotard, M., Schoester, P., Serafini, L. and Zervaki, V.,
“Towards Content-Oriented Patent Document Processing,” World Patent
Information, 30(1):21-23, March 2008.
133. West, D., M. Digital Government Technology and Public Sector Performance.
Princeton University Press, Princeton, NJ, 2005.
134. WestLaw. http://www.westlaw.com (Accessed on 03/01/2012).
135. Wiemer-Hastings, P., “Latent Semantic Analysis,” In Encyclopedia of
Language and Linguistics, Elsevier, Oxford, UK, 2nd edition, pp. 706-709,
2004.
136. World Health Organization, “Manual of the International Statistical
Classification of Diseases, Injuries, and Causes of Death, 9th Revision,”
Geneva, Switzerland, 1977.
137. Xue, X. and Croft, W., B., “Automatic Query Generation for Patent Search,”
In Proceeding of the 18th ACM Conference on information and Knowledge
Management, Hong Kong, China, pp. 2037-2040, Nov 2009.
138. Yang, Y., “Noise Reduction in a Statistical Approach to Text Categorization,”
In Proceedings of the 18th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR '95), Edward A.
Fox, Peter Ingwersen, and Raya Fidel (Eds.), ACM, New York, NY, USA, pp.
256-263, 1995.
139. Zheng, W. and Blake, C., “Bootstrapping Location Relations from Text,”
American Society for Information Science and Technology Annual Meeting,
Pittsburgh, PA, 2010.