ir and nlp

7/29/2019 ir and nlp

http://slidepdf.com/reader/full/ir-and-nlp 1/6

Combining the Best of Two Worlds: NLP and IR for Intranet Search

Suma Adindla and Udo Kruschwitz

School of Computer Science and Electronic EngineeringUniversity of Essex

Wivenhoe Park, Colchester, C04 3SQ, UK

{sadind,udo}@essex.ac.uk

Abstract—Natural language processing (NLP) is becomingmuch more robust and applicable in realistic applications.One area in which NLP has still not been fully exploitedis information retrieval (IR). In particular we are interestedin search over intranets and other local Web sites. We seedialogue-driven search which is based on a largely automatedknowledge extraction process as one of the next big steps.Instead of replying with a set of documents for a user query the

system would allow the user to navigate through the extractedknowledge base by making use of a simple dialogue manager.Here we support this idea with a first task-based evaluation thatwe conducted on a university intranet. We automatically ex-tracted entities like person names, organizations and locationsas well as relations between entities and added visual graphsto the search results whenever a user query could be mappedinto this knowledge base. We found that users are willing tointeract and use those visual interfaces. We also found thatusers prefered such a system that guides a user through theresult set over a baseline approach. The results represent animportant first step towards full NLP-driven intranet search.

Keywords-natural language processing; information re-trieval; dialogue; domain knowledge; visualization;

I. MOTIVATION

Imagine we could interact with a university intranet search

engine just like with a human person in a natural dialogue.

The search engine would automatically extract knowledge

from the Web site so that a searcher can be assisted in finding

the information required. A student who asks for a particular

course can be directed to the most recent lecture notes or the

contact details of the lecturer. An external searcher typing in

“PhD NLE” could be assisted by allowing him to explore

the space of experts and projects available in the area of

natural language engineering. Obviously, this information

can change any day and the idea is to have always the most

up-to-date facts and relations available to assist a searcher.

Currently, we do not have systems which support this typeof interaction. However, our aim is to automatically acquire

knowledge (a domain model) from the document collection

and employ that in an interactive search system.

One motivation for a system that guides a user through

the search space is the problem of “too many results”. Even

queries in document collections of limited size often return

a large number of documents, many of them not relevant

to the query. Part of the problem is the fact that both on

the Web and in intranet search queries tend to be short and

short queries always pose ambiguity and uncertanity issues

for information retrieval systems [1]. Some form of dialogue

based on feedback from the system could be very useful

in helping the user find the right results. This combination

of NLP and IR we assume is particularly promising and

scalable in smaller domains like university intranets or local

Web sites. Obviously, most queries can be answered by a

standard search engine but the use of NLP tools to extractknowledge can help address ambiguous queries as well as

those where there might be only a single relevant document

(which is common in an intranet setting).

In order to employ a dialogue system we would ideally

have access to a domain knowledge base because dialogue

systems work well with structured knowledge bases. Web

sites do have some internal structure but unlike product

catalogues and online shopping sites they are not fully

structured and the first question we face is: How can we

acquire suitable knowledge from a document collection to

support system-guided search? We are not interested in

manually extracting such knowledge, but we would like

to automate that process so that we can apply the sameapproach to a new document collection without expensive

manual customization. A related question is: What kind of

knowledge should this knowledge base (the domain model)

contain?

We propose a system that guides a user in the search

process which relies on a database automatically populated

by processing the document collection and extracting pieces

of knowledge from these documents. Along with named

entities, relations that exist between those entities are es-

sential in various practical applications [2]. We use NLP

techniques to parse all sentences in the input documents,

extract relations (such as subject-verb-object triples) and

then map user queries against these relations. Such relationsoften involve named entities (as objects or subjects or both).

Named entities have been found to play a key role in Web

and intranet search, e.g. named entities can be used to deal

with page ranking problems [3], they have also been found

to play a key role in corporate and university search logs

[4], [5].

Although we are using a knowledge base, we do not move

away from the standard search paradigm. While displaying

the results to a user, we combine the domain model with

2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology

978-0-7695-4513-4/11 $26.00 © 2011 IEEE

DOI 10.1109/WI-IAT.2011.187

483

2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology

978-0-7695-4513-4/11 $26.00 © 2011 IEEE

DOI 10.1109/WI-IAT.2011.187

483



the results of a local search engine. Our work is a first step

towards a full dialogue search system. Here we use visual

graphs to present the relations and corresponding terms for

a given query. What we are investigating here is the general

validity of our approach.

I I . RELATED WOR K

Early work on dialogue systems focused on human com-

puter interaction, e.g. ELIZA [6]. Since then a variety of

task oriented applications have been developed in various

domains. Intially, many of these dialogue systems assisted

the users in travel domains. Examples include ATIS [7]

and PHILIPS [8]. Generally speaking, one can distinguish

different types of dialogue systems, e.g. a well defined

structured database can be considered as Type I and systems

which lack knowledge bases or deal with unstructured data

as Type II [9]. Dialogue systems based on ontologies would

fall under Type I [10], our work would start from Type II.

Another possible way of guiding the user in the navigation

of search results is through faceted search. A lot of researchhas been done in this area and we can also see commercial

online sites and even libraries1 support this feature. However,

the difference is that faceted search systems typically rely

on well-structured databases, in other words they make use

of rich structure in the knowledge bases [11].

Question-answering (QA) systems are related to our work

as they tend to rely on similar NLP techniques that we apply

although the main idea of a QA system is to return an

answer rather than a list of documents. The first question

answering systems were only natural language interfaces to

structured databases [12]. Progress in Information Extraction

has more recently contributed much to the success of factoid

based question answering systems. To support interaction, anelement of dialogue has been added to a number of question

answering systems, e.g. [13]. An example of high quality

interaction question answering system is HITIQA [14].

We extract knowledge entirely from unstructured data

available on (e.g.) a university website. This makes our work

different from the above mentioned dialogue systems.

In recent years, Web search algorithms have matured

significantly by adapting to the users’ information needs.

An example is named entity recognition. Named entities are

becoming increasingly popular in Web search. A study has

shown that 71% of Web queries constitute named entities

and identifying entities in a query further improves retrieval

performance [15]. Analysing the search logs we have beencollecting at the University confirms this observation. The

sort of entities people search for might not coincide with

typically identified ones such as dates, organisations and

locations. In our logs it was found that queries like person

names, room numbers, labs, course titles etc were routinely

searched for. In addition to that 10% of our search queries

1http://search.trln.org/search.jsp

Figure 1. Overview of the system components

consist of person names. This evidence indeed supports our

work and also recommends the need for query type identi-

fication. Like on the Web, we can categorize user queries

into some general types: information needs, browsing and

transactional etc. [16]. The use of Web search engines haswitnessed quite a bit of progress in that respect, compared to

that intranet users still experience poor search results [17].

It has however been shown that understanding a query type

(who, where, when) would be quite useful in an intranet

domain [4].

Another area that is worth exploring in information re-

trieval is visualization. Information visualization is an im-

portant aspect in information retrieval systems. Also, visual

interfaces are excellent tools for interacting and exploring

search results. Various studies have been conducted to test

the significance of visualization for information retrieval

systems [18], [19] suggest the use of various visualization

methods for information organization. For information re-trieval systems, the presentation of search results is still a

challenging issue [20]. One example of search system which

supports entity level search is [21] EntityCube2. Another

example is Google’s3 Wonder Wheel.

III. TOWARDS DIALOGUE-DRIVEN INTRANET SEARCH

Our system consists of two parts: offline knowedge

extraction and an online mapping process that maps the

query into the extracted knowledge. With the help of NLP

tools and information extraction techniques, we process the

document collection automatically to build a domain model.

In an offline extraction process we extract named entities

and predicate argument structures from all documents of

the local Web site at hand. We thus turn the university

Web document collection into a usable knowledge base

by populating it with named entities and simple facts. To

identify entities (person, organization and location names),

we use the Annie IE system that is part of the Gate 4 NLP

2http://entitycube.research.microsoft.com/index.aspx3http://www.googlewonderwheel.com/ 4http://gate.ac.uk/

484484



toolkit. We use GATE, but any similar NLP tool could be

employed. For extracting simple facts, we use the Stanford

parser and our extraction methodology is similar to [22].

The extracted triples are represented in terms of subject-

verb-object pairs.

Along with triplet relations, we have also extracted de-

pendency/predicate relations from the sentences. We will

consider different ways of aiding users by suggesting various

query options. This knowledge can then be used to guide the

dialogue manager and our extracted relations are similar to

the ones presented by [23]. Figure 1 shows an overview of

the system architecture.

In the second part, we try to map a user query against the

knowledge base. The key component of our system is the

dialogue manager and it is also responsible for the online

mapping process. Whenever a user submits a query, the

dialogue manager tries to map the query against the domain

model and simultaneously submits it to search engine. For

any user query which can be found in the knowledge base

the dialogue manager allows the user to navigate through

the knowledge base by presenting the relations that map

the user query in some way (e.g. if the user query is a

named entity that has relations with other named entities in

the database). When displaying search results to a user, we

combine the extracted domain knowledge with the results

of a local search engine. Figure 2 shows a screenshot of

our dialogue system. Results from the search engine are

presented alongside a graph of extracted knowledge related

to the query. In the figure the query is shown in the centre

and edges present the dialogue manager suggestions for

the above query. We use various colour codes to illustrate

different types of terms. The green ones are the entities and

the red colour terms indicate the relations. When a user

clicks on any one of those entities the corresponding search

box automatically updates with the clicked entity. With

this interface also a user could interact during information

searching. We have used JIT5 for visualization purposes.

Semantic graphs are starting to get used in assisted search,

e.g. in question answering [24].

A. Domain Model

In the first stage we identified named entities such as per-

son names, organization names, and locations. We capture

simple facts (relation between entities) from the sentences by

using the Stanford parser and populate the database with this

knowledge. This gives us a structured database, a network of related terms and the corresponding relations. A user query

can then be matched against any part of this knowledge base.

If the user query was a person name (very common in our

domain), the dialogue manager would come up with various

suggestions (department, role, contact details, other people,

projects etc.). By using this piece of information, we frame

5http://thejit.org/

questions, generate answers and vice versa as shown in the

motivating example, similar to [25] but more generic.

We have two separate tables one for the entities and the

other for relations. While query mapping, we extract the

terms that match the user query from both the tables. With

the proposed dialogue system (which will eventually go

beyond a graphical interaction) a user could also engage

in a dialogue, but the user is obviously not required to do

so.

IV. EVALUATION

We conducted this first evaluation to explore the potential

of the outlined idea.

The methodology we employed to evaluate our dialogue

system is a task-based evaluation. We followed the TREC6

interactive track guidelines for comparing two systems. Here

we are comparing our system against a baseline system. The

two systems can be characterized as follows:

1) System A is the baseline system which is the search

engine currently installed at the local university Web

site.

2) System B is our system that works based on the

automatically extracted domain model to guide a user

in the navigation of search results. Here the domain

entities and relations are represented in visual graphs

with various colour codes.

Both systems index the same document collection (they

both use Nutch as a backend system.) System A is the

UKSearch system [26], [27]. This system also suggests

some query modification terms to refine and relax the query

(presented as a flat list of links). We assume this is a

fair comparison because it has been shown in previous

experiments on the same university Web site that usersclearly prefer a search engine that makes suggestions over a

Google-style search engine [27]. We therefore consider the

current search engine a sensible baseline to compare against.

We will now explain the experimental procedure and

later we will discuss the results. While conducting the

evaluation, users were not told anything about the underlying

differences between the both systems. For the task-based

evaluation we used the questionnaires introduced by the

TREC-9 interactive series. These four questionnaires were

employed:

1) Entry Questionnaire

2) Postsearch Questionnaire

3) Postsystem Questionnaire4) Exit Questionnaire

A. Procedure

According to TREC interactive track guidelines at least

16 participants and 8 search tasks are required to conduct

an evaluation that compares two systems in a task-based

6http://www-nlpir.nist.gov/projects/t9i/qforms.html

485485



Figure 2. Dialogue system screenshot

evaluation. We recruited 16 students from the university

population (actual target users in this context). Search tasks

were designed based on query logs obtained on the uni-

versity’s intranet search engine. The terms in parentheses

are the queries found in the logs based on which the tasks

were constructed. These terms were not included in the

instructions for the subjects. Two sample tasks are:

• Task 1 (course) Imagine you are an undergraduate

student and wish to study for a master in economics at

Essex. Find a document that provides details of various

MSc programs in economics.

• Task 5 (car parking): Imagine you are attending a

seminar at the university. Please find a document which

gives details about visitor car parking areas and appli-

cable charges.

We explained the experimental setup and showed one

example on both systems before the evaluation process.

Initially, users started with the entry questionnaire. Each

subject was then asked to perform 4 search tasks on System

A and the remaining four on System B (or the other wayround). Tasks, systems and subjects were permutated based

on the Latin square matrix used by [28]. Subjects were

given 5 minutes to perform each task. After performing

each search task the users had to fill in the postsearch

questionnaire. Along with the questionnaire, they were asked

to submit the answer and rate their task success. When all

the four tasks were finished on one system, users were given

the postsystem questionnaire to be filled in. In the end, users

filled in the exit questionnaire.

V. RESULTS AND DISCUSSION

Of our 16 participants 13 were male and 3 were fe-

male studying in various departments with a range of age

(between 19 and 32) and experience (e.g. online search

experience between 4 to 12 years). Most of the users were

postgraduate students studying for a Masters degree or a PhD

and the remaining were undergraduates. For the question on

searching behavior, the majority of subjects (13) selected 5

and the remaining ones selected 4 (where 5 indicates “daily”

and 4 indicates “weekly”). Among our participants, 8 users

agreed that they enjoy carrying out information searches.

After completion of each task, users filled in the post-

search questionnaire. The following questions with 5-point

Likert scale ratings were used for both systems (where 1

indicates ”not at all” and 5 indicates ”extremely”). To study

the significance, t-tests have been conducted for comparison

wherever necessary:

1) Are you familiar with this topic?2) Was it easy to get started on this search?

3) Was it easy to do the search on this topic?

4) Are you satisfied with your search results?

5) Did you have enough time to do an effective search?

For the question Was it easy to get started on this

search? users prefered System B (but without statistical

significance). In regards to the question Was it easy to do the

search on this topic? users also found it easier to search on

486486



S y s t e m

F a m i l i a r

E a s y t o S t a r t

E a s y t o S e a r c h

S a t i s f a c t i o n

E n o u g h T i m e

A 3.32 4.09 4.09 4.53 4.68

B 3.09 4.26 4.45 4.57 4.68

Table IPOSTSEARCH QUESTIONNAIRE

Parameter System-A System-B No Difference

Easier to learn to use 4 6 6

Easier to use 1 11 4

Best 4 10 2

Table IIEXI T QUESTIONNAIRE (SYSTEM PREFERENCE)

System B. This value is statistically significant when all the

8 search tasks were considered (p <0.01). This value clearlyindicates the usefulness of domain model suggestions. Also

for the next question Are you satisfied with your search

results? users were more satisfied with the results returned

by System B. Table I summarizes the results. For most of the

above questions System B was slightly better than System A

(though not always statistically significant). The table also

illustrates that sufficient time was allocated for the tasks.

Users filled in the postsystem questionnaire after complet-

ing four search tasks on one system. When we compared

the values on both systems, the differences between them

are marginal.

Finally, users submitted an exit questionnaire. For the

question Which of the two systems did you find easier touse? 11 users picked System B and only one user opted

for System A. Furthermore, 10 users selected System B as

the best system overall, 4 users selected System A and 2

users did not find any difference between them. Table II

demonstrates that System B scored overall better than the

baseline system. The results of the exit questionnaire clearly

demonstrate the potential that this type of guided search

offers in the context of a university intranet.

We also asked for additional user feedback in the exit

questionnaire. Users liked the idea of visualizing search

terms in a graph and most of the users did in fact select the

query options suggested by System B despite the varying

quality of the extracted knowledge (this noise was alsocommented on by a user). However, in this evaluation we did

not target the quality of the extracted knowledge (which is a

separate issue). In our evaluation we represented the first 10

entities that matched the user query from the database and

did not make use of any frequency or ranking parameter.

One user commented It is useful to know not to always stick

to the same search engine as there are others that could be

just as useful.

This evaluation is our first validation of the outlined idea

that promotes the use of deep NLP in order to extract facts

and relations from document collections which can then be

used to guide a user who searches this collection. Users

were overall more satisfied with a system that makes use

of extracted facts and relations when communicating results

to the user. We found that the idea of NLP-based search

in document collections appears to be a promising route

based on this simple task-based evaluation reported here.

Furthermore, we see a lot of potential in combining NLP

techniques with state-of-the-art visualization methods.

VI. CONCLUSIONS AND FUTURE WOR K

We presented a task-based evaluation assessing the use-

fulness of incorporating a dialogue component in a search

system. We particularly targeted local Web sites such as

university intranets. The sort of dialogue system we applied

makes use of small pieces of knowledge extracted from the

document collection (and linked in a simple term network)

that can then be mapped against the query. We found that thegeneral idea of such guided search offers a lot of potential.

This work can obviously only be a first step. There are a

number of limitations in such a study and we will take the

findings as a guideline for future work. We will investigate

a variety of routes. First of all, the system we investigated

used a visual representation. We will continue doing so but

will enrich the dialogue by adding more of a real NLP

dialogue paraphrasing the knowledge found in the database.

We also aim at putting a prototype of the search engine

online so that we address a number of limitations that user

studies such as the one presented here face. Finally, the

knowledge extraction process is still not perfect and we are

still working on finding the right balance between the qualityand the quantity of relations and entities extracted from the

documents.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for

very helpful feedback on an earlier version of the pa-

per. This work is partially supported by the AutoAdapt7

research project. AutoAdapt is funded by EPSRC grants

EP/F035357/1 and EP/F035705/1.

REFERENCES

[1] B. Sun, P. Liu, and Y. Zheng, “Short query refinement with

query derivation,” in Proceedings of the 4th Asia informa-tion retrieval conference on Information retrieval technology.Berlin, Heidelberg: Springer-Verlag, 2008, pp. 620–625.

[2] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Unsupervisedextraction of semantic relations between entities on the web,”in 19th Int’l World Wide Web Conf. (WWW 2010) , Raleigh,North Carolina, USA, 2010, pp. 151–160.

7http://autoadaptproject.org

487487



[3] A. Caputo, P. Basile, and G. Semeraro, “Boosting a semanticsearch engine by named entities,” in Proceedings of the18th International Symposium on Foundations of Intelligent Systems. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 241–250.

[4] H. Li, Y. Cao, J. Xu, Y. Hu, S. Li, and D. Meyerzon, “A newapproach to intranet search based on information extraction,”

in Proceedings of CIKM’05, Bremen, Germany, 2005, pp.460–468.

[5] R. Sutcliffe, K. White, and U. Kruschwitz, “Named entityrecognition in an intranet query log,” in Proceedings of the

LREC Workshop on Semitic Languages, Valletta, Malta, 2010,pp. 43–49.

[6] J. Weizenbaum, “Eliza - a computer program for the study of natural language communication between man and machine,”Commun. ACM , vol. 9, no. 1, pp. 36–45, January 1966.

[7] S. Seneff, L. Hirschman, and V. W. Zue, “Interactive problemsolving and dialogue in the atis domain,” in Proceedings of the workshop on Speech and Natural Language. Stroudsburg,PA, USA: Association for Computational Linguistics, 1991,

pp. 354–359.

[8] H. Aust, M. Oerder, F. Seide, and V. Steinbiss, “The philipsautomatic train timetable information system,” Speech Com-mun., vol. 17, no. 3-4, pp. 249–262, November 1995.

[9] Y. cheng Pan and L. shan Lee, “Type-ii dialogue systemsfor information access from unstructured knowledge sources,”in IEEE Automatic Speech Recognition and UnderstandingWorkshop, Kyoto, Japan, December 2007.

[10] P. Quaresma and I. P. Rodrigues, “Using dialogues to accesssemantic knowledge in a web ir system,” in Proceedings of the 6th international conference on Computational processingof the Portuguese language, 2003, pp. 201–205.

[11] O. Ben-Yitzhak, N. Golbandi, N. Har’El, R. Lempel, A. Neu-mann, S. Ofek-Koifman, D. Sheinwald, E. Shekita, B. Sz-najder, and S. Yogev, “Beyond basic faceted search,” inProceedings of the international conference on Web searchand web data mining. New York, NY, USA: ACM, 2008,pp. 33–44.

[12] B. F. Green, A. K. Wolf, C. Chomsky, and K. Laughery,“Baseball: an automatic question-answerer,” in Papers pre-sented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference. New York, NY, USA: ACM, 1961,pp. 219–224.

[13] S. Gandhe, A. S.Gordon, and D. Traum, “Improving question-answering with linking dialogues,” in Proceedings of the 11thinternational conference on Intelligent user interfaces. New

York, NY, USA: ACM, 2006, pp. 369–371.

[14] S. Small and T. Strzalkowski, “Hitiqa: High-quality intelli-gence through interactive question answering,” Natural Lan-guage Engineering, vol. 15, no. 1, pp. 31–54, 2009.

[15] J. Guo, G. Xu, X. Cheng, and H. Li, “Named entity recogni-tion in query,” in 32nd international ACM SIGIR conferenceon Research and development in information retrieval (SIGIR’09). New York, USA: ACM, 2009.

[16] M. Kellar, C. Watters, and M. Shepherd, “A field studycharacterizing web-based information-seeking tasks,” Journalof the American Society for Information Science and Technol-ogy, vol. 58, no. 7, pp. 999–1018, 2007.

[17] H. Zhu, S. Raghavan, S. Vaithyanathan, and A. Loser, “Nav-igating the intranet with high precision,” in Proceedings of the 16th international conference on World Wide Web. New

York, NY, USA: ACM, 2007, pp. 491–500.

[18] S. Koshman, “Testing user interaction with a prototypevisualization-based information retrieval system,” Journal of the American Society for Information Science and Technol-ogy, vol. 56, no. 8, pp. 824–833, 2005.

[19] X. Yuan, X. Zhang, and A. Trofimovsky, “Testing visualiza-tion on the use of information systems,” in Proceedings of thethird symposium on Information interaction in context IIiX ’10. New York, USA: ACM, 2010.

[20] O. Turetken and R. Sharda, “Visualization of web spaces:state of the art and future directions,” SIGMIS Database,vol. 38, no. 13, pp. 51–81, July 2007.

[21] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. rong Wen, “Statisticalmethods statsnowball: a statistical approach to extractingentity relationships,” WWW 2009 MADRID, 2009.

[22] D. Rusu, L. Dali, B. Fortuna, M. Grobelnik, and D. Mladeni,“Triplet extraction from sentences,” in Proceedings of the 10th

International Multiconference,Information Society-IS 2007 ,2007.

[23] A. Akbik and J. Bross, “Wanderlust: Extracting semantic rela-tions from natural language text using dependency grammarpatterns,” in Workshop on Semantic Search in Conjunctionwith the 18th Int. World Wide Web Conference (WWW2009) ,Madrid, Spain, 2009.

[24] L. Dali, D. Rusu, B. Fortuna, D. Mladenic, and M. Grobelnik,“Question answering based on semantic graphs,” in Workshopon Semantic Search in Conjunction with the 18th Int. World Wide Web Conference (WWW2009), Madrid, Spain, 2009.

[25] P. Adolphs, X. Cheng, T. Klwer, H. Uszkoreit, and F. Xu,“Question answering biographic information and social net-work powered by the semantic web,” in Proceedings of theSeventh conference on International Language Resources and

Evaluation (LREC’10). Valletta, Malta: European LanguageResources Association (ELRA), 2010, pp. 19–21.

[26] U. Kruschwitz and H. Al-Bakour, “Users want more sophis-ticated search assistants: Results of a task-based evaluation,”

Journal of the American Society for Information Science and Technology, vol. 56, no. 13, pp. 1377–1393, November 2005.

[27] U. Kruschwitz, Intelligent Document Retrieval:Exploiting Markup Structure, ser. The Information Retrieval Series.Springer, 2005, vol. 17.

[28] W. Hersh and P. Over, “Trec-9 interactive track report,” inProceedings of the Ninth Text Retrieval Conference (TREC-9), 2001, pp. 41–50.

488488

ir and nlp

Documents

Transcript of ir and nlp