Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010

122
Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010

description

Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010. The Problem. http://www.jcdl2007.org. http://www.jcdl2007.org/JCDL2007_Program.pdf. The Problem. Web users experience 404 errors expected lifetime of a web page is 44 days [Kahle97] - PowerPoint PPT Presentation

Transcript of Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010

Page 1: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Synchronicity

Time on the Web - Week 3CS 895 Fall 2010

Martin [email protected]

09/15/2010

Page 2: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

2

The Problemhttp://www.jcdl2007.org

http://www.jcdl2007.org/JCDL2007_Program.pdf

Page 3: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

3

The Problem

• Web users experience 404 errors• expected lifetime of a web page is 44 days [Kahle97]

• 2% of web disappears every week [Fetterly03]

• Are they really gone? Or just relocated?• has anybody crawled and indexed it?• do Google, Yahoo!, Bing or the IA have a copy of

that page?• Information retrieval techniques needed to

(re-)discover content

Page 4: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Web Infrastructure (WI) [McCown07]

• Web search engines (Google, Yahoo!, Bing) and their caches

• Web archives (Internet Archive)• Research projects (CiteSeer)

4

The Environment

Page 6: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

1same URI maps to same or very similar content at a later time

2

same URI maps to different content at a later time

3

different URI maps to same or very similar content at the same or at a later time

4

the content can not be found at any URI

6

URI – Content Mapping Problem

U1

C1

U1

C1

timeA B

U1

C2

U1

C1

timeA B

U2

C1

U1

C1

U1

404

timeA B

U1

???

U1

C1

timeA B

Page 7: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Content Similarity

7

JCDL 2005http://www.jcdl2005.org/

July 2005http://www.jcdl2005.org/

Today

Page 8: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Content Similarity

8

Hypertext 2006http://www.ht06.org/

August 2006http://www.ht06.org/

Today

Page 9: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Content Similarity

9

PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.html

August 2003http://www.pspcentral.org/events/archive/annual_meeting_2003.html

Today

Page 10: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Content Similarity

10

ECDL 1999http://www-rocq.inria.fr/EuroDL99/

October 1999http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html

Today

Page 11: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Content Similarity

11

Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm

1999Today

? ?

Page 12: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

LS

RemovalHit

RateProxyCache

GoogleYahoo

• First introduced by Phelps and Wilensky [Phelps00]

• Small set of terms capturing “aboutness” of a document, “lightweight” metadata

12

Lexical Signatures (LSs)

ResourceAbstract

Page 13: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88]

• Term frequency (TF):– “How often does this word appear in this

document?”• Inverse document frequency (IDF):

– “In how many documents does this word appear?”

13

Generation of Lexical Signatures

Page 14: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• “Robust Hyperlink”• 5 terms are suitable• Append LS to URL

http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago

• Limitations:1. Applications (browsers) need to be modified to

exploit LSs2. LSs need to be computed a priori3. Works well with most URLs but not with all of

them 14

LS as Proposed by Phelps and Wilensky

Page 15: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• Park et al. [Park03] investigated performance of various LS generation algorithms

• Evaluated “tunability” of TF and IDF component

• Weight on TF increases recall (completeness)• Weight on IDF improves precision (exactness)

15

Generation of Lexical Signatures

Page 16: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Rank/Results URL LS

1/1 http://www.cs.berkeley.edu/˜wilensky/NLP.html

texttiling wilensky disambiguation subtopic iagohttp://www.google.com/search?q=texttiling+wilensky+disambiguation+subtopic+iago

na/10 http://www.dli2.nsf.gov nsdl multiagency imls testbeds extramuralhttp://www.google.com/search?q=nsdl+multiagency+imls+testbeds+extramural

1/221,000(1/174,000 in

01/2008)

http://www.loc.gov library collections congress thomas americanhttp://www.google.com/search?q=library+collections+congress+thomas+american

1/51(2/77 in

01/2008)

http://www.jcdl2008.org libraries jcdl digital conference psthttp://www.google.com/search?q=libraries+jcdl+digital+conference+pst

16

Lexical Signatures -- Examples

Page 17: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

17

Synchronicity

404 error occurs while browsing look for same or older page in WI (1)if user satisfied return page (2)else generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page (4) else get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages (6)

The system may not return any results at all

Page 18: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

18

Synchro…What?

Synchronicity• Experience of causally unrelated events

occurring together in a meaningful manner• Events reveal underlying pattern, framework

bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)

• “meaningful coincidence”• Deschamps – de Fontgibu plum

pudding example

picture from http://www.crystalinks.com/jung.html

Page 19: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

19

404 Errors

Page 20: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

20

404 Errors

Page 21: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

21

“Soft 404” Errors

Page 22: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

22

“Soft 404” Errors

Page 23: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

A Comparison of Techniques for Estimating IDF Values to Generate

Lexical Signatures for the Web(WIDM 2008)

Page 24: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• LSs are usually generated following the TF-IDF scheme

• TF rather trivial to compute• IDF requires knowledge about:

• overall size of the corpus (# of documents)• # of documents a term occurs in

• Also not complicated to compute for bounded corpora (such as TREC)

• If the web is the corpus, values can only be estimated

The Problem

Page 25: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• Use IDF values obtained from 1. Local collection of web pages2. ``screen scraping‘‘ SE result pages

• Validate both methods through comparison to baseline

• Use Google N-Grams as baseline• Note: N-Grams provide term count (TC)

and not DF values – details to come

The Idea

Page 26: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

26

Accurate IDF Values for LSs

Screen scraping the Google web interface

Page 27: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

27

The Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

Page 28: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Same as above, follows Zipf distribution

10,493 observations254,384 total terms16,791 unique terms

The Dataset

Page 29: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Total terms vs new terms

The Dataset

Page 30: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms

LSs Example

Page 31: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k

2. Kendall Tau• Modified version since LSs to compare

may contain different terms3. M-Score

• Penalizes discordance in higher ranks

Comparing LSs

Page 32: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Top 5, 10 and 15 terms

LC – local universe

SC – screen scraping

NG – N-Grams

Comparing LSs

Page 33: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• Both methods for the computation of IDF values provide accurate results• compared to the Google N-Gram baseline

• Screen scraping method seems preferable since• similaity scores slightly higher• feasible in real time

Conclusions

Page 34: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Correlation of Term Count and Document Frequency for Google N-Grams

(ECIR 2009)

Page 35: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• Need of a reliable source to accurately compute IDF values of web pages (in real time)

• Shown, screen scraping works but• missing validation of baseline (Google N-

Grams)• N-Grams seem suitable (recently created,

based on web pages) but provide TC and not DF what is their relationship?

The Problem

Page 36: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

36

Background & Motivation

• Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept• Used (among others) to generate lexical signatures (LSs)

• TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated!

• Most text corpora provide term count values (TC)

D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”

TC >= DF but is there a correlation? Can we use TC to estimate DF?

Term All Buy Can’t Is Love Me Need Please You Long

TC 1 1 1 1 2 2 1 2 1 3

DF 1 1 1 1 2 2 1 1 1 1

Page 37: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• Investigate relationship between:• TC and DF within the Web as Corpus (WaC)• WaC based TC and Google N-Gram based TC

• TREC, BNC could be used but:• they are not free• TREC has been shown to be somewhat dated

[Chiang05 ]

The Idea

Page 38: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• Analyze correlation of list of terms ordered by their TC and DF rank by computing:• Spearman‘s Rho• Kendall Tau

• Display frequency of TC/DF ratio for all terms• Compare TC (WaC) and TC (N-Grams)

frequencies

The Experiment

Page 39: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

39

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Rank similarity of all terms

Page 40: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

40

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Spearman’s ρ and Kendall τ

Page 41: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

41

Experiment Results

Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM

10 GRANT ARIA PAPERS GRANT

Google: screen scraping DF (?) values from the Google web interface

Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr

U = 14∩ = 6

Strong indicator that TC can be used to estimate DF for web pages!

Page 42: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Integer ValuesTwo Decimals One Decimal

Frequency of TC/DF Ratio Within the WaC

Experiment Results

Page 43: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

43

Experiment ResultsShow similarity between WaC based TC and

Google N-Gram based TC

TC frequencies

N-Grams have a threshold of 200

Page 44: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• TC and DF Ranks within the WaC show strong correlation

• TC frequencies of WaC and Google N-Grams are very similiar

• Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages

Does not mean everything correlated to TC can be used as DF substitude!

Conclusions

Page 45: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Inter-Search EngineLexical Signature Performance

(JCDL 2009)

Page 46: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Inter-Search EngineLexical Signature Performance

Martin Klein Michael L. Nelson{mklein,mln}@cs.odu.edu

http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta

Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks

Asian, Trunk

Page 47: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

47

Page 48: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Revisiting Lexical Signatures to(Re-)Discover Web Pages

(ECDL 2008)

Page 49: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

49

How to Evaluate the Evolution of LSs over Time

Idea: • Conduct overlap analysis of LSs• LSs based on local universe mentioned above

• Neither Phelps and Wilensky nor Park et al. did that• Park et al. just re-confirmed their findings after 6

month

Page 50: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

50

Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

Page 51: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

10-term LSs generated forhttp://www.perfect10wines.com

LSs Over Time - Example

Page 52: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

52

LS Overlap Analysis

Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observed

Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last

Page 53: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

53

Evolution of LSs over Time

Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return

Rooted

Page 54: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

54

Evolution of LSs over Time

Results:• Overlap increases over time• Seem to reach steady state around 2003

Sliding

Page 55: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

55

Performance of LSs

Idea: • Query Google search API with LSs• LSs based on local universe mentioned above• Identify URL in result set

• For each URL it is possible that:1. URL is returned as the top ranked result2. URL is ranked somewhere between 2 and 103. URL is ranked somewhere between 11 and 1004. URL is ranked somewhere beyond rank 100

considered as not returned

Page 56: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

56

Performance of LSs wrt Number of Terms

Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best

• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered

• 8 terms and beyond do not show improvement

Page 57: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

57

Performance - Number of Terms

• Lightest gray = rank 1

• Black = rank 101 and beyond

• Ranks 11-20, 21-30,… colored proportionally

• 50% top ranked, 20% in top 10, 30% black

Rank distribution of 5 term LSs

Performance of LSs wrt Number of Terms

Page 58: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

58

Performance of LSs

Scoring (generalized from Park et al.)Equation in Section 6.1

• Fair:• Gives credit to all URLs equally with linear spacing

between ranks• Optimistic:

• Bigger penalty for lower ranks

• Scores for the position of a URL in a list of 10:• Fair: 10/10, 9/10, 8/10 … 1/10, 0• Optimistic: 1/1, 1/2, 1/3 … 1/10, 0

Page 59: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

59

Fair and optimistic score for LSs consisting of 2-15 terms(mean values over all years)

Performance of LSs wrt Number of Terms

Page 60: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

60

Performance of LSs over Time

Score for LSs consisting of 2, 5, 7 and 10 terms

Fair Optimistic

Page 61: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize

• 5-, 6- and 7-term LSs seem to perform best• 7 – most top ranked• 5 – fewest undiscovered• 5 – lowest mean rank

• 8 terms and beyond hurt performance

Conclusions

Page 62: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Evaluating Methods to Rediscover Missing Web Pages from theWeb Infrastructure

(JCDL 2010)

Page 63: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

63

The Problem

Internet Archive - Wayback Machine

63

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

59 copies

The Problem

Page 64: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

64

The Problem

64

www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

The Problem

Page 65: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

65

The Problemwww.aircharter-international.com

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

The Problem

Page 66: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

66

The Problem

If no archived/cached copy can be found...

Tags

C?B

A

Link Neighborhood (LNLS)

The Problem

Page 67: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

67

The ProblemThe Problem

Page 68: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

68

Contributions

• Compare performance of four automated methods to rediscover web pages1. Lexical signatures (LSs) 3. Tags

2. Titles 4. LNLS

• Analysis of title characteristics wrt their retrieval performance

• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery

Contributions

Page 69: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

69

Experiment - Data Gathering

• 500 URIs randomly sampled from DMOZ

• Applied filters– .com, .org, .net, .edu domains

– English Language

– min. of 50 terms [Park]

• Results in 309 URIs to download and parse

Data Gathering

Page 70: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

70

Experiment - Data Gathering

• Extract title– <Title>...</Title>

• Generate 3 LSs per page– IDF values obtained from Google, Yahoo!, MSN Live

• Obtain tags from delicious.com API (only 15%)

• Obtain link neighborhood from Yahoo! API (max. 50 URIs)– Generate LNLS

– TF from “bucket” of words per neighborhood

– IDF obtained from Yahoo! API

Data Gathering

Page 71: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

71

LS Retrieval Performance

5- and 7-Term LSs

•Yahoo! returns most URIs top ranked and leaves least undiscovered

•Binary retrieval pattern, URI either within top 10 or undiscovered

LS Retrieval Performance

Page 72: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

72

Title Retrieval Performance

Non-Quoted and Quoted Titles

•Results at least as good as for LSs

•Google and Yahoo! return more URIs for non-quoted titles

•Same binary retrieval pattern

Title Retrieval Performance

Page 73: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

73

Tags Retrieval Performance

•API returns up to top10 tags - distinguish between # of tags queried

•Low # of URIs

Tags Retrieval Performance

Page 74: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

74

LNLS Retrieval Performance

•5- and 7-term LNLSs

•< 5% top ranked

LNLS Retrieval Performance

Page 75: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

75

Query LNLS

Combination of Methods

Can we achieve better retrieval performance if we combine 2 or more methods?

Done

Done

Done

Query Tags

Query Title

Query LS

Combination of Methods

Page 76: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

76

Combination of Methods

Top Top10 UndisLS5 50.8 12.6 32.4LS7 57.3 9.1 31.1TI 69.3 8.1 19.7TA 2.1 10.6 75.5 Top Top10 Undis

LS5 67.6 7.8 22.3LS7 66.7 4.5 26.9TI 63.8 8.1 27.5TA 6.4 17.0 63.8Top Top10 Undis

LS5 63.1 8.1 27.2LS7 62.8 5.8 29.8TI 61.5 6.8 30.7TA 0 8.5 80.9

Google

Yahoo!

MSN Live

Combination of Methods

Page 77: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

77

Combination of Methods

Google Yahoo! MSN LiveLS5-TI 65.0 73.8 71.5LS7-TI 70.9 75.7 73.8TI-LS5 73.5 75.7 73.1TI-LS7 74.1 75.1 74.1

LS5-TI-LS7 65.4 73.8 72.5LS7-TI-LS5 71.2 76.4 74.4TI-LS5-LS7 73.8 75.7 74.1TI-LS7-LS5 74.4 75.7 74.8

LS5-LS7 52.8 68.0 64.4LS7-LS5 59.9 71.5 66.7

Top Results for Combination of Methods

Combination of Methods

Page 78: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

78

•Length varies between 1 and 43 terms

•Length between 3 and 6 terms occurs most frequently and performs well [Ntoulas]

Title Characteristics

Length in # of Terms

Title Characteristics

Page 79: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

79

•Length varies between 4 and 294 characters

•Short titles (<10) do not perform well

•Length between 10 and 70 most common

•Length between 10 and 45 seem to perform best

Title Characteristics

Length in # of Characters

Title Characteristics

Page 80: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

80

•Title terms with a mean of 5,6,7 characters seem most suitable for well performing terms

•More than 1 or 2 stop words hurts performance

Title Characteristics

Mean # of Characters, # of Stop Words

Title Characteristics

Page 81: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

81

Concluding Remarks

Lexical signatures, as much as titles, are very suitable as search engine queries to rediscover missing web pages. They return 50-70% URIs top ranked.

Tags and link neighborhood LSs do not seem to significantly contribute to the retrieval of the web pages.

Titles are much cheaper to obtain than LSs.The combination of primarily querying titles and 5-term LSs as a second option returns more than 75% URIs top ranked.

Not all titles are equally good.Titles containing between 3 and 6 terms seem to perform best. More than a couple of stop words hurt the performance.

Conclusions

Page 82: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Is This a Good Title?(Hypertext 2010)

Page 83: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

83

The Problem

Professional Scholarly Publishing 2003http://www.pspcentral.org/events/annual_meeting_2003.html

The Problem

Page 84: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

84

The Problem

Internet Archive - Wayback Machine

84

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

59 copies

The Problem

Page 85: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

85

The Problem

85

www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

The Problem

Page 86: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

86

The Problemwww.aircharter-international.com

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

The Problem

Page 87: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

87

The Problemhttp://www.drbartell.com/

Lexical Signature(TF/IDF)Plastic Surgeon Reconstructive Dr Bartell Symbol University

???

The Problem

Page 88: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

88

The Problemhttp://www.drbartell.com/

TitleThomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery

The Problem

Page 89: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

89

The Problem

89

www.reagan.navy.mil

Lexical Signature(TF/IDF)Ronald USS MCSN Torrey Naval Sea Commanding

The Problem

Page 90: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

90

The Problem

TitleHome Page ???

www.reagan.navy.mil

Is This a Good Title?

The Problem

Page 91: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

91

Contributions

• Discuss discovery performance of web pages titles (compared to LSs)

• Analysis of discovered pages regarding their relevancy

• Display title evolution compared to content evolution over time

• Provide prediction model for title’s retrieval potential

Contributions

Page 92: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

92

Experiment - Data Gathering

• 20k URIs randomly sampled from DMOZ

• Applied filters– English language – min. of 50 terms

• Results in 6.875 URIs

• Downloaded and parsed the pages

• Extract title and generate LS per page (baseline).com .org .net .edu sum

Original 15289 2755 1459 497 20000Filtered 4863 1327 369 316 6875

Data Gathering

Page 93: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

93

Title (and LS) Retrieval Performance

Titles 5- and 7-Term LSs

•Titles return more than 60% URIs top ranked

•Binary retrieval pattern, URI either within top 10 or undiscovered

Title and LS Retrieval Performance

Page 94: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

94

???

Relevancy of Retrieval Results

•Distinguish between discovered (top 10) and undiscovered URIs

•Analyze content of top 10 results

•Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank

Do titles return relevant results besides the original URI?

Relevancy of Retrieval Results

Page 95: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

95

Relevancy of Retrieval Results

Term OverlapDiscovered Undiscovered

High relevancy in the top rankswith possible aliases and duplicates.

Relevancy of Retrieval Results

Page 96: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

96

Relevancy of Retrieval Results

ShinglesDiscovered Undiscovered

More optimal shingles values than top ranked URIs - possible aliases and duplicates.

Relevancy of Retrieval Results

Page 97: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

97

1998-01-27Sun Software Products Selector Guides - Solutions Tree

1999-02-20Sun Software Solutions

2002-02-01Sun Microsystems Products

2002-06-01Sun Microsystems - Business & Industry Solutions

2003-08-01Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions

Title Evolution - Example I

2004-02-02Sun Microsystems – Solutions

2004-06-10Gateway Page - Sun Solutions

2006-01-09Sun Microsystems Solutions & Services

2007-01-03Services & Solutions

2007-02-07Sun Services & Solutions

2008-01-19Sun Solutions

www.sun.com/solutions

Title Evolution – Example I

Page 98: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

98

2000-06-19DataCity of Manassas Park Main Page

2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives

2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives

Title Evolution - Example II

2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free

2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB

www.datacity.com/mainf.html

Title Evolution – Example II

Page 99: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

99

•Copies from fixed size time windows per year

•Extract available titles of past 14 years

•Compute normalized Levenshtein edit distance between titles of copies and baseline(0 = identical; 1 = completely dissimilar)

How much do titles change over time?

Title Evolution Over TimeTitle Evolution Over Time

Page 100: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

100

Title Evolution Over Time

Title edit distance frequencies

•Half the titles of available copies from recent years are (close to) identical

•Decay from 2005 on (with fewer copies available)

•4 year old title:40% chance to be unchanged

Title Evolution Over Time

Page 101: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

101

Title Evolution Over Time

Title vs Document•Y: avg shingle value

for all copies per URI

•X: avg edit distance of corresponding titles

•overlap indicated by:green: <10red: >90

•Semi-transparent: total amount of points plotted

[0,1] - over 1600 times

[0,0] - 122 times

Title Evolution Over Time

Page 102: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

102

Title Performance Prediction

•Quality prediction of title by

•Number of nouns, articles etc.

•Amount of title terms, characters ([Ntoulas])

•Observation of re-occurring terms in poorly performing titles - “Stop Titles”

home, index, home page, welcome, untitled document

The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!

[Ntoulas]A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92

Title Performance Prediction

Page 103: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

103

Concluding Remarks

The “aboutness” of web pages can be determined from either the content or from the title.

More than 60% of URIs are returned top ranked when using the title as a search engine query.

Titles change more slowly and less significantly over time than the web pages’ content.

Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor.

Conclusions

Page 104: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

Comparing the Performance ofUS College Football Teams

in the Web and on the Field(Hypertext 2009)

Page 105: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

105

Naming Conventions

Football

Soccer

Naming Conventions

Page 106: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

106

Motivation

• “Does Authority mean Quality?”[Amento00]

• Link-based web page metrics can be used to estimate experts’ assessment of quality

• Lists compiled by experts are cool!

– Companies, schools, people, places, etc

• “Big 3” search engines play a central role in our lives

– “If I can’t find it in the top 10 it doesn’t exist in the web”

– SEOs

• Do expert rankings of real-world entities correlate with search engine ranking of corresponding web resources?

Motivation

Page 107: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

107

Background

•Expert ranking of real-world entities:

•Collegiate football programs in the US

•Associated Press (AP) poll

•65 sportswriters and broadcasters

•USA Today Coaches poll

•63 college football head coaches

•Published once a week, top 25 teams, 25-1 point system

• “Big 3” search engines

•Google, Yahoo and MSN Live (APIs)

Background

Page 108: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

108

US College Football Season 2008

•2008 season began on August 28th 2008

•Concluded January 8th 2009

•18 instances of poll data:

•Final polls from 2007 season (as a baseline)

•2008 pre-season polls

•once for each of the 16 weeks of the 2008 season

US College Football Season 2008

Page 109: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

109

Mapping Resources to URLs

•Often impossible to distill the canonical URL for a football program

•e.g. Virginia Tech college football returned

•Official school page

•Commercial sports sites

•Wikipedia

•Blogs, Fan sites, etc

Mapping Resources to URIs

Page 110: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

110

Mapping Resources to URLs

•Query 3 search engine APIs for representative URLs

•Query: schoolname+College+Football

•e.g.: Ohio+State+College+Football

•Aggregate the top 8 representative URLs (n = 1 .. 8)

•Temporal aspect in mind:

•Repeat query and renew aggregation weekly

Mapping Resources to URIs

Page 111: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

111

Ordinal Ranking of URLs from SE Queries

We are not interested in computing search engine’s absolute ranking for a particular URL (PR values)

BUT

We are determining that a search engine ranks URLs in order

Ordinal Ranking of URIs from SE Queries

Page 112: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

112

Ordinal Ranking of URLs from SE Queries

•Search engines enforce query restrictions (length, amount per day etc)

•Build unbiased and overlapping queries

•site and OR operators

•Variation of strand sort

USC Georgia Ohio State Oklahoma Florida

site:http://usctrojans.cstv.com/sports/m-footbl/usc-m-footbl-body.html ORsite:http://uga.rivals.com/ ORsite:http://sportsillustrated.cnn.com/football/ncaa/teams/ohiost/ ORsite:http://www.soonersports.com/ ORsite:http://www.gatorzone.com/

Ordinal Ranking of URIs from SE Queries

Page 113: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

113

Weighting Ranked URLs

• If real-world resources are mapped to more than one URL (n > 1)

•Need to accumulate ranking score

•Determine one final overall school score

•Assign weights per URL depending on their rank

P - Position of URL in result set

T - Total number of URLs in the list (n * number of teams)

Weighting Ranked URIs

Page 114: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

114

Correlation Results

Kendall Tau used to test for statistically significant (p<0.05) correlation

Top 10 AP Poll Top 10 USA Poll

Correlation Results

Page 115: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

115

Correlation Results

Top 25 AP Poll Top 25 USA Poll

“Inertia”

Correlation Results

Kendall Tau used to test for statistically significant (p<0.05) correlation

Page 116: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

116

n-Values for Correlation

Top 10 AP Poll Top 10 USA Poll

N-Values for Correlation

Page 117: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

117

n-Values for Correlation

Top 25 AP Poll Top 25 USA Polln=2..6

N-Values for Correlation

Page 118: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

118

Correlation of Overlapping URLsOver Time

USC Georgia Ohio State Oklahoma

Florida Missouri Texas Texas TechAlabam

a BYU Penn State Utah

• 12 schools occur in all AP polls throughout the season

•Given the “inertia”, by how much does the web trail?

•Can we measure a “delayed correlation”?

•Declare AP ranking for each week as separate “truth values”

•Compute correlation between truth values and search engine ranking

• Expect to see in increased correlation in the weeks following the truth value

Correlation of Overlapping URIs Over Time

Page 119: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

119

Correlation of Overlapping URLsOver Time

n=8

Correlation of Overlapping URIs Over Time

Page 120: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

120

Correlation between Attendanceand SE and Polls

AP USAToday

Googlen=6

Googlen=1

Correlation Between Attendance and SE and Polls

Page 121: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010

121

Concluding Remarks

• Inspired by “Does Authority mean Quality?” we asked “Does Quality mean Authority?”

• High correlations for the last seasons final rankings and rankings early in the season

• Correlation decreases because of “inertia”

• No correlation between attendance and search engine rankings

Conclusions

Although authority means quality, quality does not necessarily mean authority - at least not immediately.

Page 122: Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010