Web Rec Final Report

Web Recommender Project Final Report

Wei Chen, Yue (Jenny) Cui

Motivation People use the web to browse information. One problem is that there is too much information

on the web, and it usually takes time to search for information. So it would be helpful to

facilitate this web-browsing experience in a way that is convenient, fast and accurate. An

existing solution to this problem is to use the search engine. In a typical search engine scenario,

a user types in a query, and the search engine returns relevant pages. Using a search engine to

retrieve relevant pages is not fully automatic; it requires user effort to make and type in a query.

Our goal is to develop a tool to automatically generate queries for the user when s/he is

reading a web page. Then we can use this query to recommend relevant web pages to the user.

Problem Statement

What is a Web Recommender?

A web recommender is a web-browsing tool which recommends relevant web pages to the user

while s/he is reading a page.

Why is it important?

A web recommender provides a convenient way to browse the web. It automatically

recommends relevant information. It requires less effort to make and type in queries. At the

mean time, it reserves the benefit from the state-of-the-art search engine.

Why is it hard?

Making queries from a web page is a keyword summarization problem, which is still an active

research topic. Also, search engines are not perfect: they can return dead links, and they can

return irrelevant pages. Furthermore, it is often hard to define what it means to be relevant. It

depends on different reading goals. All of these are related issues of web recommendation. We

do not attempt to conquer all of them. In this particular project, we focus on the first issue,

which is to extract queries from a web page.

Link to Vision Statement

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/VisionStatement1

Goals for this project (solution) We have three goals for this project:

(1) Provide a software framework for Web Recommendation

(2) Provide basic recommendation algorithms

(3) Propose an evaluation prototype

The first goal defines the basic functionality of this software. The second goal provides three

kinds of services. First, it offers basic service to the web recommender. Second, it offers

baselines for future research on Web Recommendation. Finally, it can be used as a tutorial for

teaching people how to develop their own algorithm based on our software framework.

Link to Vision Statement

Link to Domain Model

Requirements

Functional Requirements

(1) Given a web page as input, the system should be able to find a list of relevant web pages.

(2) The system should provide three recommendation algorithms.

a. Baseline algorithm: uses simple string processing techniques

b. HTML-Structure-based algorithm: uses HTML structure features

c. Semantics-based algorithm: uses NLP techniques (named entity recognizer) to

extract features

(3) The system should provide a simple GUI for evaluation.

Non-functional Requirements

Recommendation results can be retrieved in 5 seconds.

Link to Requirement Analysis

Design Our design has three components: a general software framework design, algorithm design, and

evaluation task design.

Software Framework Design

Class Diagram:

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/ClassDiagramFinal

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/VisionStatement1

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/DomainModel

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/Analysis

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/ClassDiagramFinal

The main algorithm of WebRecommender is implemented in the method recommend(). The util

package provides tools for HTML parsing, basic text processing and NLP tools that are needed

for the recommendation algorithm. QueryFilter is used for key-term selection.

QueryFormulator can be used for combining multiple queries.

Sequence Diagram illustrates an example message flow:

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/SequenceDiagram

Algorithm Design

We designed three algorithms: baseline algorithm, HTML-structure-based algorithm and

semantics-based algorithm. The algorithms are described below.

Baseline Algorithm

1. Strip off HTML tags (e.g. </html>)

2. Remove non-word tokens (e.g. “/**/”)

3. Remove stop words (e.g. “the”)

HTML Structure-based Algorithm

1. Parse HTML page

2. Extract text content from node <title> and <a>


4. Select the 10 most frequent words

Semantic-based Algorithm

1. Strip off HTML tags (e.g. </html>)

2. Tag the page using Stanford named entity tagger

3. Remove non-word tokens (e.g. “/**/”)


5. Select named entities with highest frequency (top 5)

Example Query Comparison

Input page: http://en.wikipedia.org/wiki/Entropy

Table 1. Example query comparison

Algorithm Output Query

Baseline Entropy, free, encyclopedia, Jump, search, article

HTML-Structure ISBN, edit, entropy, thermodynamics, Entropy, energy, system, law, heat, thermodynamic

Semantic ISBN, University, Press, Boltzmann, John

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/SequenceDiagram

http://en.wikipedia.org/wiki/Entropy

Evaluation Design

Evaluation Form

We designed an evaluation form which consists of three fields: input page, recommended page,

and relevance score. We ask our evaluators to score each recommended page. The relevance

score has two values: 1 means “relevant”; 0 means “irrelevant”. The form also contains two

fields that are hidden from the evaluator: the algorithm used to produce the recommended

page and the rank of the page. These two fields are used for evaluation, and they are invisible

to the evaluators.

Evaluation Criteria

We used the modified Average Precision to aggregate relevance scores. The standard average

precision is calculated as the sum of precision at each position divided by the total number of

relevant pages. In our modified version, we replace the number of relevant pages in the

denominator with the total number of retrieved pages.

An example of the calculation of modified average precision is shown in our final project

presentation:

link to Final Presentation

Test Data Selection

Our criterion for test data selection is that it has to span multiple dimensions. The dimensions

we considered include:

1. Popular vs. Unpopular (e.g., “Harry Porter” vs. “Wei Chen”)

2. Ambiguous vs. Unambiguous (e.g., “Entropy” vs.“Sushi”)

3. New vs. Old (e.g., “Waterboarding” vs. “Entropy”)

4. Procedural vs. Conceptual (e.g., “How to” vs. “Entropy”)

5. Technological vs. Mass media (e.g., “Entropy” vs. “Harry Porter”)

Based on the test data selection criteria, we selected 5 input pages from 5 topics:

1. “Harry Porter” http://en.wikipedia.org/wiki/Harry_potter

2. “Waterboarding” http://en.wikipedia.org/wiki/Waterboarding

3. “Wei Chen@CMU homepage” http://www.cs.cmu.edu/~weichen/

4. “Entropy (thermodynamics)” http://en.wikipedia.org/wiki/Entropy

5. “How to make Sushi” http://www.wikihow.com/Make-Sushi

N

rrelrPePModifiedAv

N

r))()((

1

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/FinalPpt

http://en.wikipedia.org/wiki/Harry_potter

http://en.wikipedia.org/wiki/Waterboarding

http://www.cs.cmu.edu/~weichen/

http://en.wikipedia.org/wiki/Entropy

http://www.wikihow.com/Make-Sushi

Evaluation GUI

Link to GUI Demo

Our evaluation GUI is composed of three function areas: the top panel where user type in the

url of the input web page, the left panel where all the urls of the recommended web pages are

displayed, and the content panel which display the web page the user selects. The top panel

includes an internet address bar and the recommendation button. User types in the url of a

webpage, if he presses the enter key then the input web page will be showed in the large

content panel behind it. If the user clicks the recommend button, then the urls of the

recommended web pages are displayed in the left panel in the GUI.

Evaluation and Results One important question we want to answer in this project is how well each our algorithms are.

So we need to design an experiment which can measure user satisfaction fairly. Our first

hypothesis is that given different kinds of topics the performance of our algorithms will be

different. But at the design stage we are not sure how huge the variation will be.

Our second hypothesis is that users will disagree on how useful the recommended web pages

are. Because if a user changes his goal, he will change his evaluation criteria at the same time.

In order to avoiding non-standard criteria, we limit our evaluation criteria only to the relevancy

of the recommended pages. In our ReadMe file we specify the definition of relevancy for each

of the topics. By doing this we think we can measure the user satisfaction to each of our

algorithms.

Experimental Design

link to example evaluation form

We have three algorithms: baseline, semantic and structure algorithm. We finally chose 5 topics

for our experiment. It is important that the web pages our algorithms recommend contain the

information our user need. But it is equally important that they appear at the top of the list of

the recommended web pages. After we combine each algorithm with 5 topics, we get totally 15

categories (e.g. (baseline, topic1) is one category). We use the top five recommended web

pages from each algorithm. Then each rater evaluates in total 75 recommended web pages.

Whenever a rater thinks a recommended web page is relevant, he scores one in the score

column in the evaluation form.

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/EvalDemo

http://seit1.lti.cs.cmu.edu/projects/webrecommender/attachment/wiki/WikiStart/WR-Evaluation-Result-Anthony.xls

Participants

We have total of 5 participants. Three are females and two are males. All of the raters are with

at least a master degree in computer science. One of them is a native English speaker. The rest

four are not.

Experimental procedure

All the raters read the Readme file which gives out the definition of relevancy for each topic

before they do the evaluation.

Results

Link to a presentation of evaluation results

Our results show that our first hypnosis is correct. Topics that are popular and have more

resources on the web have better scores. The topic “Harry Potter” has the highest relevancy

score. All of our three algorithms recommended satisfying web pages. We think the reason is

there are so many web pages about Harry Potter. So it is easy to find relevant web pages. The

topic “Waterboarding” has the highest number of invalid web pages. We think the reason is

waterboarding is a typical news topic. Most of the time there are few web pages which talk

about Waterboarding. There are few resource on this topic on the web. But one it becomes a

news headline. Many resource are added to the web. But after some time when it is no longer

the headline of the news. Many of the resources are probably deleted from the web. That could

cause the invalid links. The topic “how to make Sushi” has the lowest relevancy score. We think

the reason is it is about a specific procedure which makes the definition of relevancy more strict.

Among our three algorithms, in this experiment structure algorithm has the best performance.

The difference between baseline algorithm and structure algorithm is significant with p-value (p

< 0.001). The difference between baseline and semantic algorithm is not significant.

The structure algorithm has the best performance on topic “entropy” with relevancy score of 1.

This is a really promising result, because if the target users of web recommender are people

who are in academic, they would use it to find technology information. For example, we could

combine web recommender with Wikipedia. Then the users would get more comprehensive

information on the topic he/she is interested in. The structure algorithm has very good

performance on the topic “How to make Sushi” whereas the baseline and semantic algorithms

have the worst the performance. We think the reason is the structure algorithm uses key terms

that are extracted from anchor tags. These anchor tags point to other relevant web pages. So

the key terms extracted from anchor tags are very more relevant than the key terms which we

extract from other part of the web page.

http://seit1.lti.cs.cmu.edu/projects/webrecommender/attachment/wiki/WikiStart/evaluation-result.ppt

Error Analysis

Table 2. Key terms and score of all the categories of topic and algorithm

Topic Algorithm Key Term Score

Entropy Baseline Entropy free encyclopedia Jump search article 0.519

Semantic ISBN University Press Boltzmann John 0.6926

Structure ISBN edit entropy thermodynamics Entropy energy system law heat thermodynamic 1

Harry Potter

Baseline Harry Potter free encyclopedia Jump search 0.9032

Semantic Harry Potter Voldemort BBC Rowling 0.9686

Structure Potter Harry Rowling Witch Deathly Goblet Magic witchcraft Film Hallows 0.982

Waterboarding

Baseline Waterboarding free encyclopedia Jump search Cambodia Khmer 0.507

Semantic CIA United York Bush States 0.1738

Structure Torture News York Waterboarding Times Press CIA ISBN torture Washington 0.8564

Wei Chen Baseline Wei Chen graduate student Language Technologies Carnegie Mellon research advisor 0.7444

Semantic Chen Wei University NMF Johns 0.457

Structure States Natural Language Mental Fahlman Word Jack Lingual AAAI Wei 0.713

How to make Sushi

Baseline Make 10 steps wikiHow Manual Edit RSS Create account log prepared

0.1274

Semantic RL Commons Article Creative Nicole 0

Structure Sushi Make edit Ads Roll wikiHow Article make Rice Show 0.7444

Overall the semantic algorithm’s performance is not as good as we expected. We expect it at

least as good as the structure algorithm. The semantic algorithm scores zero in topic “How to

make Sushi”. By looking at causes of that, we find that the name entity algorithm we use can

only identify name of person and organization. So it doesn’t include the important key word

“Sushi”. Then by looking at other topics we find that actually the semantic algorithm identify

important name entities which are relevant to the topic and useful to the algorithm. But only

using these name entities is not sufficient enough. We think if we combine the key words in the

title of the web pages and the name entities we extracted as an input to our query, we would

have much better result in the future.

For topic “entropy”, both semantic and structure algorithms score better than the baseline

algorithm. We think the reason behind is in the baseline algorithm there are some “noise” key

terms which affect the performance of the baseline algorithm. It makes the baseline algorithm

give out some irrelevant web pages. It is also a very promising sign that the semantic and

structure algorithm will make difference in the recommendation results.

All of the three algorithms are perform very well on the topic “Harry Potter”. We think there

are two reasons for this: first the definition of relevancy for a popular topic is much broader.

Anything about Harry Potter will be thought as relevant no matter it is about the book, the

movie, the author, or the actors. Second, there are so many web pages about Harry Potters on

the web. So it is easier to find 5 from them.

The reason for the error pages in the topic of “Waterboarding” is some of the links are invalid.

Because “Waterboarding” is a time sensitive topic, the content of the recommended web pages

could be deleted at the time of evaluation. By looking at the links we see that they are usually

links to the user generated content pages like forums.

For topic “Wei Chen”, the relevant web pages about this topic are very few. The structure

algorithm only returns 4 web pages as the result. But two of them are relevant. So we think the

major reason for the error pages is the scarce of the relevant web pages on the web.

For topic “how to make Sushi”, we think it will be a difficult case for web recommender. One

problem is caused by the key word “make”. The algorithm gave out pages that are about how-

to make something else instead of Sushi. The other problem is there are a lot of content that

are user generated about this topic. So some of the recommended pages are from forums and

invalid at the time of evaluation.

Conclusion

This experiment gave us a lot of feedback about our algorithms used in web Recommender.

Now we know how topics play a role on the recommendation results of each of the algorithms.

We can also conclude from the experiment that our algorithms do make significant difference

on the recommendation results. We probably could predict which kind of topic web

recommender will be most useful.

Software Engineering Techniques used in this project We followed the standard software engineering process in this project: requirement analysis,

design, implementation, and evaluation. We used iterative development process at design,

implementation, and evaluation phase. Table 3 summarizes the iterations in each phase. It also

summarizes the main changes we went through.

Table 3. Highlights of software engineering process

Design Implementation Evaluation

Iteration 1 1. Initial design of

framework

2. Composite-pattern

based evaluation

design

1. Initial implementation of framework 2. Implemented evaluation component based on composite pattern

1. Pilot study

2, Weighted average

relevance score

Iteration 2 1. Added query formulator and query filter 2. Simplified

evaluation design

1. Implemented query

formulator and query

filter

2. Implemented

simplified version of

evaluation GUI

1. 5 raters, 5 input

pages

2. Modified average

precision

What changed over the semester?

As Table 3 showed, we made changes in each of the development phases. Major changes are documented in several meetings notes: Changes in Main Framework: http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes02-02-2009 http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes02-11-2009 http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes02-18-2009 http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes03-04-2009 Changes in Evaluation Component: http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes04-06-2009 http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes04-20-2009 http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes04-22-2009 Our evaluation GUI went through several rounds of changes. Stage 1: Planed to use relational database to store and retrieve evaluation results Stage 2: Discarded the idea of relational database. Use composite pattern to implement aggregation of evaluation scores. Link to composite pattern based design Stage 3: Discarded the idea of composite pattern. Simplified the evaluation GUI. Implemented GUI. Link to the GUI Demo Stage 4: GUI found to be slow. Used Excel files to store and calculate evaluation scores. link to

example evaluation form

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes02-02-2009







http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/EvalDesign

http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/EvalDemo




What would we change if we did the project over again?

1. We would improve our risk analysis: one tricky thing about risk analysis is that it is unexpected. We didn’t expect that speed will be a problem of our GUI.

2. Evaluation took more time than we had thought. We want to allow more time for evaluation, because we need time for pilot study before we conduct the experiment. Then we can have detailed and systematic analysis of the algorithms and improve our algorithms based on the analysis.

3. We would improve our time management: We should start evaluation early so that we can improve our algorithms based on evaluation results.

Acknowledgements We own many thanks to Dr. Nyberg, Dr. Tomasic, Shilpa and Hideki for valuable comments and suggestions on our project throughout the semester. We thank our raters for the evaluation task. We also thank our classmates for many helpful discussions.

Web Rec Final Report

Technology

Transcript of Web Rec Final Report