Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

43
WEB PAGE CLASSIFICATION Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender

Transcript of Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Page 1: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

WEB PAGE CLASSIFICATION

Features and Algorithms

Paper by: XIAOGUANG QI and BRIAN D. DAVISONPresentation by: Jason Bender

Page 2: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Outline Introduction to Classification Background

Classification TypesClassification Methods

Applications Features Algorithms Evolution of Websites

Page 3: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

What is web page classification? The process of assigning a web page to

one or more predefined category labels (ex: news, sports, business…)

Classification is generally posed as a supervised learning problemSet of labeled data is used to train a

classifier which is applied to label future examples

Page 4: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Background - Classification Types Supervised learning problem broken into

sub problems:Subject ClassificationFunctional ClassificationSentiment ClassificationOther types of Classification

Page 5: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Subject Classification

Concerned with subject or topic of the web pageJudging whether a page is about arts,

business, sports, etc…

Functional Classification Role that the page is playing

Deciding a page to be a personal homepage, course page, admissions page, etc…

Page 6: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Sentiment Classification

Focuses on the opinion that is presented in a web page

Other types of Classification

Such as genre classification and search engine spam classification

Page 7: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Background - Classification Methods Binary vs. Multiclass Single Label vs. Multi Label Soft vs. Hard Flat vs. Hierarchical

Page 8: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Binary vs. Multiclass Classification

Page 9: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Single-Label vs. Multi-Label Classification

Page 10: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Soft vs. Hard Classification

Page 11: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Flat vs. Hierarchical Classification

Page 12: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Applications

Why is classification important and how can we use it efficiently?

Page 13: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Constructing, maintaining, or expanding web directories

Web directories provide an efficient way to browse for information within a predefined set of categories

Example:Open Directory Project

Currently constructed by human effort78,940 editors of ODP

Page 14: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Improving the quality of search results Big problem with search results is

search ambiguity

Page 15: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Helping question and answering systems Can use classification systems to help

improve the quality of answers Example: Wolfram alpha

Other applications Contextual advertising

Page 16: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Page 17: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Features

What features can we extract from a web page to use to help classify it?

Page 18: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Features - Introduction

Because of features such as the hyperlink <a> … </a>, webpage classification is vastly different from other forms of classification such as plaintext classification.

Features organized into two groups:○ On-page features – directly located on page○ Neighbor features – found on related pages

Page 19: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

On Page Features Textual Contents & Tags

Bag-of-words○ N-gram feature

Rather than analyzing individual words, group them into clusters of n-words. - Ex: New York vs. new ….. ….. York

Yahoo! Has used a 5-gram feature

HTML tags – title, heading, metadata, main text

URL

Page 20: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

On Page Features

Visual AnalysisEach page has two representations

○ Text via HTML○ Visual via the browser

Each page can be represented as a visual adjacency multigraph

Page 21: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Features of Neighbors

What happens when a page’s features are missing or are unrecognizable?

Page 22: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Features of Neighbors

AssumptionsIf page1 is in the neighborhood of many

“sports” pages then there is an increasing probability that page1 is also a “sports” page.

Linked pages are more likely to have terms in common

Page 23: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Features of Neighbors Neighbor Selection

Focus on pages within 2 steps of target6 types: parent, child, sibling, spouse,

grandparent, and grandchild

Page 24: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Features of Neighbors

Labels Anchor Text Surrounding Anchor Text

By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.

Page 25: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Features of Neighbors

Implicit LinksConnections between pages that appear in

the results of the same query and are both clicked by users

Page 26: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Algorithms

What are the algorithmic approaches to webpage classification?Dimension reductionRelational learningHierarchal classificationInformation combination

Page 27: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Dimension Reduction

Boost classification by emphasizing certain features that are more useful in classificationFeature Weighting

○ Reduces the dimensions of feature space○ Reduces computational complexity○ Classification more accurate as a result of

reduced space

Page 28: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Dimension Reduction

MethodsUse first fragmentK-nearest neighbor algorithm

○ Weighted features○ Weighted HTML Tags○ Metrics

Expected mutual informationMutual information

Page 29: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Relational Learning

Relaxation Labeling

Page 30: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Hierarchical Classification Based on “divide and conquer”

Classification problems split into hierarchical set of sub problems.

Error MinimizationWhen a lower level category is uncertain of

whether page belongs or not, shift assignment one level up.

Page 31: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Information Combination

Combine several methods into oneInformation from different sources are used

to train multiple classifiers and the collective work of those classifiers make a final decision.

Page 32: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Conclusion

Webpage classification is a type of supervised learning problem aiming to categorize a webpage into a predefined set of categories.

In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier

Page 33: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Apple in 1998

Page 34: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Apple 2008

Page 35: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Nike in 2000

Page 36: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Nike in 2008

Page 37: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Yahoo in 1996

Page 38: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Yahoo in 2008

Page 39: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Microsoft in 1998

Page 40: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

Microsoft in 2008

Page 41: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

MTV in 1998

Page 42: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Evolution of Websites

MTV in 2008

Page 43: Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Sources Web Page Classification: Features and Algorithms

by Xiaoguang Qi & Brian D. Davison

Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classification

by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic

The Evolution of Websiteshttp://www.wakeuplater.com/website-building/evolution-of-websites-10-popular-websites.aspx