Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010

Synchronicity

Time on the Web - Week 3CS 895 Fall 2010

Martin [email protected]

09/15/2010

2

The Problemhttp://www.jcdl2007.org

http://www.jcdl2007.org/JCDL2007_Program.pdf

3

The Problem

• Web users experience 404 errors• expected lifetime of a web page is 44 days [Kahle97]

• 2% of web disappears every week [Fetterly03]

• Are they really gone? Or just relocated?• has anybody crawled and indexed it?• do Google, Yahoo!, Bing or the IA have a copy of

that page?• Information retrieval techniques needed to

(re-)discover content

Web Infrastructure (WI) [McCown07]

• Web search engines (Google, Yahoo!, Bing) and their caches

• Web archives (Internet Archive)• Research projects (CiteSeer)

4

The Environment

Digital preservation happens in the WI

5

Refreshing and Migration in the WI

Google Scholar

CiteSeerX

Internet Archive

http://scholar.google.com/scholar?q=A+Comparison+of+Queueing,+Cluster+and+Distributed+Computing+Systems

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.1154

http://web.archive.org/web/*/http:/techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf

1same URI maps to same or very similar content at a later time

2

same URI maps to different content at a later time

3

different URI maps to same or very similar content at the same or at a later time

4

the content can not be found at any URI

6

URI – Content Mapping Problem

U1

C1

U1

C1

timeA B

U1

C2

U1

C1

timeA B

U2

C1

U1

C1

U1

404

timeA B

U1

???

U1

C1

timeA B

Content Similarity

7

JCDL 2005http://www.jcdl2005.org/

July 2005http://www.jcdl2005.org/

Today

Content Similarity

8

Hypertext 2006http://www.ht06.org/

August 2006http://www.ht06.org/

Today

Content Similarity

9

PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.html

August 2003http://www.pspcentral.org/events/archive/annual_meeting_2003.html

Today

Content Similarity

10

ECDL 1999http://www-rocq.inria.fr/EuroDL99/

October 1999http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html

Today

Content Similarity

11

Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm

1999Today

? ?

LS

RemovalHit

RateProxyCache

GoogleYahoo

• First introduced by Phelps and Wilensky [Phelps00]

• Small set of terms capturing “aboutness” of a document, “lightweight” metadata

12

Lexical Signatures (LSs)

ResourceAbstract

http://www.google.com/search?q=removal+hit+rate+proxy+cache

http://search.yahoo.com/search?p=removal+hit+rate+proxy+cache

• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88]

• Term frequency (TF):– “How often does this word appear in this

document?”• Inverse document frequency (IDF):

– “In how many documents does this word appear?”

13

Generation of Lexical Signatures

• “Robust Hyperlink”• 5 terms are suitable• Append LS to URL

http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago

• Limitations:1. Applications (browsers) need to be modified to

exploit LSs2. LSs need to be computed a priori3. Works well with most URLs but not with all of

them 14

LS as Proposed by Phelps and Wilensky

• Park et al. [Park03] investigated performance of various LS generation algorithms

• Evaluated “tunability” of TF and IDF component

• Weight on TF increases recall (completeness)• Weight on IDF improves precision (exactness)

15

Generation of Lexical Signatures

Rank/Results URL LS

1/1 http://www.cs.berkeley.edu/˜wilensky/NLP.html

texttiling wilensky disambiguation subtopic iagohttp://www.google.com/search?q=texttiling+wilensky+disambiguation+subtopic+iago

na/10 http://www.dli2.nsf.gov nsdl multiagency imls testbeds extramuralhttp://www.google.com/search?q=nsdl+multiagency+imls+testbeds+extramural

1/221,000(1/174,000 in

01/2008)

http://www.loc.gov library collections congress thomas americanhttp://www.google.com/search?q=library+collections+congress+thomas+american

1/51(2/77 in

01/2008)

http://www.jcdl2008.org libraries jcdl digital conference psthttp://www.google.com/search?q=libraries+jcdl+digital+conference+pst

16

Lexical Signatures -- Examples

http://www.google.com/search?q=texttiling+wilensky+disambiguation+subtopic+iago

http://www.google.com/search?q=texttiling+wilensky+disambiguation+subtopic+iago

http://www.google.com/search?q=nsdl+multiagency+imls+testbeds+extramural

http://www.google.com/search?q=nsdl+multiagency+imls+testbeds+extramural

http://www.google.com/search?q=library+collections+congress+thomas+american

http://www.google.com/search?q=library+collections+congress+thomas+american

http://www.google.com/search?q=libraries+jcdl+digital+conference+pst

http://www.google.com/search?q=libraries+jcdl+digital+conference+pst

17

Synchronicity

404 error occurs while browsing look for same or older page in WI (1)if user satisfied return page (2)else generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page (4) else get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages (6)

The system may not return any results at all

18

Synchro…What?

Synchronicity• Experience of causally unrelated events

occurring together in a meaningful manner• Events reveal underlying pattern, framework

bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)

• “meaningful coincidence”• Deschamps – de Fontgibu plum

pudding example

picture from http://www.crystalinks.com/jung.html

19

404 Errors

20

404 Errors

21

“Soft 404” Errors

22

“Soft 404” Errors

A Comparison of Techniques for Estimating IDF Values to Generate

Lexical Signatures for the Web(WIDM 2008)

• LSs are usually generated following the TF-IDF scheme

• TF rather trivial to compute• IDF requires knowledge about:

• overall size of the corpus (# of documents)• # of documents a term occurs in

• Also not complicated to compute for bounded corpora (such as TREC)

• If the web is the corpus, values can only be estimated

The Problem

• Use IDF values obtained from 1. Local collection of web pages2. ``screen scraping‘‘ SE result pages

• Validate both methods through comparison to baseline

• Use Google N-Grams as baseline• Note: N-Grams provide term count (TC)

and not DF values – details to come

The Idea

26

Accurate IDF Values for LSs

Screen scraping the Google web interface

27

The Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

Same as above, follows Zipf distribution

10,493 observations254,384 total terms16,791 unique terms

The Dataset

Total terms vs new terms

The Dataset

Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms

LSs Example

1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k

2. Kendall Tau• Modified version since LSs to compare

may contain different terms3. M-Score

• Penalizes discordance in higher ranks

Comparing LSs

Top 5, 10 and 15 terms

LC – local universe

SC – screen scraping

NG – N-Grams

Comparing LSs

• Both methods for the computation of IDF values provide accurate results• compared to the Google N-Gram baseline

• Screen scraping method seems preferable since• similaity scores slightly higher• feasible in real time

Conclusions

Correlation of Term Count and Document Frequency for Google N-Grams

(ECIR 2009)

• Need of a reliable source to accurately compute IDF values of web pages (in real time)

• Shown, screen scraping works but• missing validation of baseline (Google N-

Grams)• N-Grams seem suitable (recently created,

based on web pages) but provide TC and not DF what is their relationship?

The Problem

36

Background & Motivation

• Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept• Used (among others) to generate lexical signatures (LSs)

• TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated!

• Most text corpora provide term count values (TC)

D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”

TC >= DF but is there a correlation? Can we use TC to estimate DF?

Term All Buy Can’t Is Love Me Need Please You Long

TC 1 1 1 1 2 2 1 2 1 3

DF 1 1 1 1 2 2 1 1 1 1

• Investigate relationship between:• TC and DF within the Web as Corpus (WaC)• WaC based TC and Google N-Gram based TC

• TREC, BNC could be used but:• they are not free• TREC has been shown to be somewhat dated

[Chiang05 ]

The Idea

• Analyze correlation of list of terms ordered by their TC and DF rank by computing:• Spearman‘s Rho• Kendall Tau

• Display frequency of TC/DF ratio for all terms• Compare TC (WaC) and TC (N-Grams)

frequencies

The Experiment

39

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Rank similarity of all terms

40

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Spearman’s ρ and Kendall τ

41

Experiment Results

Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM

10 GRANT ARIA PAPERS GRANT

Google: screen scraping DF (?) values from the Google web interface

Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr

U = 14∩ = 6

Strong indicator that TC can be used to estimate DF for web pages!

Integer ValuesTwo Decimals One Decimal

Frequency of TC/DF Ratio Within the WaC

Experiment Results

43

Experiment ResultsShow similarity between WaC based TC and

Google N-Gram based TC

TC frequencies

N-Grams have a threshold of 200

• TC and DF Ranks within the WaC show strong correlation

• TC frequencies of WaC and Google N-Grams are very similiar

• Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages

Does not mean everything correlated to TC can be used as DF substitude!

Conclusions

Inter-Search EngineLexical Signature Performance

(JCDL 2009)

Inter-Search EngineLexical Signature Performance

Martin Klein Michael L. Nelson{mklein,mln}@cs.odu.edu

http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta

Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks

Asian, Trunk

Revisiting Lexical Signatures to(Re-)Discover Web Pages

(ECDL 2008)

49

How to Evaluate the Evolution of LSs over Time

Idea: • Conduct overlap analysis of LSs• LSs based on local universe mentioned above

• Neither Phelps and Wilensky nor Park et al. did that• Park et al. just re-confirmed their findings after 6

month

50

Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

10-term LSs generated forhttp://www.perfect10wines.com

LSs Over Time - Example

52

LS Overlap Analysis

Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observed

Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last

53

Evolution of LSs over Time

Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return

Rooted

54

Evolution of LSs over Time

Results:• Overlap increases over time• Seem to reach steady state around 2003

Sliding

55

Performance of LSs

Idea: • Query Google search API with LSs• LSs based on local universe mentioned above• Identify URL in result set

• For each URL it is possible that:1. URL is returned as the top ranked result2. URL is ranked somewhere between 2 and 103. URL is ranked somewhere between 11 and 1004. URL is ranked somewhere beyond rank 100

considered as not returned

56

Performance of LSs wrt Number of Terms

Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best

• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered

• 8 terms and beyond do not show improvement

57

Performance - Number of Terms

• Lightest gray = rank 1

• Black = rank 101 and beyond

• Ranks 11-20, 21-30,… colored proportionally

• 50% top ranked, 20% in top 10, 30% black

Rank distribution of 5 term LSs


58

Performance of LSs

Scoring (generalized from Park et al.)Equation in Section 6.1

• Fair:• Gives credit to all URLs equally with linear spacing

between ranks• Optimistic:

• Bigger penalty for lower ranks

• Scores for the position of a URL in a list of 10:• Fair: 10/10, 9/10, 8/10 … 1/10, 0• Optimistic: 1/1, 1/2, 1/3 … 1/10, 0

59

Fair and optimistic score for LSs consisting of 2-15 terms(mean values over all years)


60

Performance of LSs over Time

Score for LSs consisting of 2, 5, 7 and 10 terms

Fair Optimistic

• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize

• 5-, 6- and 7-term LSs seem to perform best• 7 – most top ranked• 5 – fewest undiscovered• 5 – lowest mean rank

• 8 terms and beyond hurt performance

Conclusions

Evaluating Methods to Rediscover Missing Web Pages from theWeb Infrastructure

(JCDL 2010)

63

The Problem

Internet Archive - Wayback Machine

63

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

59 copies

The Problem

http://ht06.org/

http://ht06.org/

64

The Problem

64

www.aircharter-international.com


The Problem

http://ht06.org/

65

The Problemwww.aircharter-international.com


The Problem

http://ht06.org/

66

The Problem

If no archived/cached copy can be found...

Tags

C?B

A

Link Neighborhood (LNLS)

The Problem

67

The ProblemThe Problem

68

Contributions

• Compare performance of four automated methods to rediscover web pages1. Lexical signatures (LSs) 3. Tags

2. Titles 4. LNLS

• Analysis of title characteristics wrt their retrieval performance

• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery

Contributions

69

Experiment - Data Gathering

• 500 URIs randomly sampled from DMOZ

• Applied filters– .com, .org, .net, .edu domains

– English Language

– min. of 50 terms [Park]

• Results in 309 URIs to download and parse

Data Gathering

70


• Extract title– <Title>...</Title>

• Generate 3 LSs per page– IDF values obtained from Google, Yahoo!, MSN Live

• Obtain tags from delicious.com API (only 15%)

• Obtain link neighborhood from Yahoo! API (max. 50 URIs)– Generate LNLS

– TF from “bucket” of words per neighborhood

– IDF obtained from Yahoo! API

Data Gathering

71

LS Retrieval Performance

5- and 7-Term LSs

•Yahoo! returns most URIs top ranked and leaves least undiscovered

•Binary retrieval pattern, URI either within top 10 or undiscovered

LS Retrieval Performance

72

Title Retrieval Performance

Non-Quoted and Quoted Titles

•Results at least as good as for LSs

•Google and Yahoo! return more URIs for non-quoted titles

•Same binary retrieval pattern

Title Retrieval Performance

73

Tags Retrieval Performance

•API returns up to top10 tags - distinguish between # of tags queried

•Low # of URIs

Tags Retrieval Performance

74

LNLS Retrieval Performance

•5- and 7-term LNLSs

•< 5% top ranked

LNLS Retrieval Performance

75

Query LNLS

Combination of Methods

Can we achieve better retrieval performance if we combine 2 or more methods?

Done

Done

Done

Query Tags

Query Title

Query LS


76


Top Top10 UndisLS5 50.8 12.6 32.4LS7 57.3 9.1 31.1TI 69.3 8.1 19.7TA 2.1 10.6 75.5 Top Top10 Undis

LS5 67.6 7.8 22.3LS7 66.7 4.5 26.9TI 63.8 8.1 27.5TA 6.4 17.0 63.8Top Top10 Undis

LS5 63.1 8.1 27.2LS7 62.8 5.8 29.8TI 61.5 6.8 30.7TA 0 8.5 80.9

Google

Yahoo!

MSN Live


77


Google Yahoo! MSN LiveLS5-TI 65.0 73.8 71.5LS7-TI 70.9 75.7 73.8TI-LS5 73.5 75.7 73.1TI-LS7 74.1 75.1 74.1

LS5-TI-LS7 65.4 73.8 72.5LS7-TI-LS5 71.2 76.4 74.4TI-LS5-LS7 73.8 75.7 74.1TI-LS7-LS5 74.4 75.7 74.8

LS5-LS7 52.8 68.0 64.4LS7-LS5 59.9 71.5 66.7

Top Results for Combination of Methods


78

•Length varies between 1 and 43 terms

•Length between 3 and 6 terms occurs most frequently and performs well [Ntoulas]

Title Characteristics

Length in # of Terms


79

•Length varies between 4 and 294 characters

•Short titles (<10) do not perform well

•Length between 10 and 70 most common

•Length between 10 and 45 seem to perform best


Length in # of Characters


80

•Title terms with a mean of 5,6,7 characters seem most suitable for well performing terms

•More than 1 or 2 stop words hurts performance


Mean # of Characters, # of Stop Words


81

Concluding Remarks

Lexical signatures, as much as titles, are very suitable as search engine queries to rediscover missing web pages. They return 50-70% URIs top ranked.

Tags and link neighborhood LSs do not seem to significantly contribute to the retrieval of the web pages.

Titles are much cheaper to obtain than LSs.The combination of primarily querying titles and 5-term LSs as a second option returns more than 75% URIs top ranked.

Not all titles are equally good.Titles containing between 3 and 6 terms seem to perform best. More than a couple of stop words hurt the performance.

Conclusions

Is This a Good Title?(Hypertext 2010)

83

The Problem

Professional Scholarly Publishing 2003http://www.pspcentral.org/events/annual_meeting_2003.html

The Problem

mailto:[email protected]

84

The Problem

Internet Archive - Wayback Machine

84

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com



59 copies

The Problem

http://ht06.org/

http://ht06.org/

85

The Problem

85

www.aircharter-international.com


The Problem

http://ht06.org/

86

The Problemwww.aircharter-international.com


The Problem

http://ht06.org/

87

The Problemhttp://www.drbartell.com/

Lexical Signature(TF/IDF)Plastic Surgeon Reconstructive Dr Bartell Symbol University

???

The Problem

http://ht06.org/

88

The Problemhttp://www.drbartell.com/

TitleThomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery

The Problem

http://ht06.org/

89

The Problem

89

www.reagan.navy.mil

Lexical Signature(TF/IDF)Ronald USS MCSN Torrey Naval Sea Commanding

The Problem

http://ht06.org/

90

The Problem

TitleHome Page ???

www.reagan.navy.mil

Is This a Good Title?

The Problem

http://ht06.org/

91

Contributions

• Discuss discovery performance of web pages titles (compared to LSs)

• Analysis of discovered pages regarding their relevancy

• Display title evolution compared to content evolution over time

• Provide prediction model for title’s retrieval potential

Contributions

92


• 20k URIs randomly sampled from DMOZ

• Applied filters– English language – min. of 50 terms

• Results in 6.875 URIs

• Downloaded and parsed the pages

• Extract title and generate LS per page (baseline).com .org .net .edu sum

Original 15289 2755 1459 497 20000Filtered 4863 1327 369 316 6875

Data Gathering

93

Title (and LS) Retrieval Performance

Titles 5- and 7-Term LSs

•Titles return more than 60% URIs top ranked

•Binary retrieval pattern, URI either within top 10 or undiscovered

Title and LS Retrieval Performance

94

???

Relevancy of Retrieval Results

•Distinguish between discovered (top 10) and undiscovered URIs

•Analyze content of top 10 results

•Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank

Do titles return relevant results besides the original URI?


95


Term OverlapDiscovered Undiscovered

High relevancy in the top rankswith possible aliases and duplicates.


96


ShinglesDiscovered Undiscovered

More optimal shingles values than top ranked URIs - possible aliases and duplicates.


97

1998-01-27Sun Software Products Selector Guides - Solutions Tree

1999-02-20Sun Software Solutions

2002-02-01Sun Microsystems Products

2002-06-01Sun Microsystems - Business & Industry Solutions

2003-08-01Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions

Title Evolution - Example I

2004-02-02Sun Microsystems – Solutions

2004-06-10Gateway Page - Sun Solutions

2006-01-09Sun Microsystems Solutions & Services

2007-01-03Services & Solutions

2007-02-07Sun Services & Solutions

2008-01-19Sun Solutions

www.sun.com/solutions

Title Evolution – Example I

http://ht06.org/

98

2000-06-19DataCity of Manassas Park Main Page

2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives

2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives

Title Evolution - Example II

2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free

2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB

www.datacity.com/mainf.html

Title Evolution – Example II

http://www.sun.com/solutions

99

•Copies from fixed size time windows per year

•Extract available titles of past 14 years

•Compute normalized Levenshtein edit distance between titles of copies and baseline(0 = identical; 1 = completely dissimilar)

How much do titles change over time?

Title Evolution Over TimeTitle Evolution Over Time

100

Title Evolution Over Time

Title edit distance frequencies

•Half the titles of available copies from recent years are (close to) identical

•Decay from 2005 on (with fewer copies available)

•4 year old title:40% chance to be unchanged


101


Title vs Document•Y: avg shingle value

for all copies per URI

•X: avg edit distance of corresponding titles

•overlap indicated by:green: <10red: >90

•Semi-transparent: total amount of points plotted

[0,1] - over 1600 times

[0,0] - 122 times


102

Title Performance Prediction

•Quality prediction of title by

•Number of nouns, articles etc.

•Amount of title terms, characters ([Ntoulas])

•Observation of re-occurring terms in poorly performing titles - “Stop Titles”

home, index, home page, welcome, untitled document

The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!

[Ntoulas]A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92

Title Performance Prediction

103

Concluding Remarks

The “aboutness” of web pages can be determined from either the content or from the title.

More than 60% of URIs are returned top ranked when using the title as a search engine query.

Titles change more slowly and less significantly over time than the web pages’ content.

Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor.

Conclusions

Comparing the Performance ofUS College Football Teams

in the Web and on the Field(Hypertext 2009)

105

Naming Conventions

Football

Soccer

Naming Conventions

106

Motivation

• “Does Authority mean Quality?”[Amento00]

• Link-based web page metrics can be used to estimate experts’ assessment of quality

• Lists compiled by experts are cool!

– Companies, schools, people, places, etc

• “Big 3” search engines play a central role in our lives

– “If I can’t find it in the top 10 it doesn’t exist in the web”

– SEOs

• Do expert rankings of real-world entities correlate with search engine ranking of corresponding web resources?

Motivation

107

Background

•Expert ranking of real-world entities:

•Collegiate football programs in the US

•Associated Press (AP) poll

•65 sportswriters and broadcasters

•USA Today Coaches poll

•63 college football head coaches

•Published once a week, top 25 teams, 25-1 point system

• “Big 3” search engines

•Google, Yahoo and MSN Live (APIs)

Background

108

US College Football Season 2008

•2008 season began on August 28th 2008

•Concluded January 8th 2009

•18 instances of poll data:

•Final polls from 2007 season (as a baseline)

•2008 pre-season polls

•once for each of the 16 weeks of the 2008 season

US College Football Season 2008

109

Mapping Resources to URLs

•Often impossible to distill the canonical URL for a football program

•e.g. Virginia Tech college football returned

•Official school page

•Commercial sports sites

•Wikipedia

•Blogs, Fan sites, etc

Mapping Resources to URIs

110

Mapping Resources to URLs

•Query 3 search engine APIs for representative URLs

•Query: schoolname+College+Football

•e.g.: Ohio+State+College+Football

•Aggregate the top 8 representative URLs (n = 1 .. 8)

•Temporal aspect in mind:

•Repeat query and renew aggregation weekly

Mapping Resources to URIs

111

Ordinal Ranking of URLs from SE Queries

We are not interested in computing search engine’s absolute ranking for a particular URL (PR values)

BUT

We are determining that a search engine ranks URLs in order

Ordinal Ranking of URIs from SE Queries

112

Ordinal Ranking of URLs from SE Queries

•Search engines enforce query restrictions (length, amount per day etc)

•Build unbiased and overlapping queries

•site and OR operators

•Variation of strand sort

USC Georgia Ohio State Oklahoma Florida

site:http://usctrojans.cstv.com/sports/m-footbl/usc-m-footbl-body.html ORsite:http://uga.rivals.com/ ORsite:http://sportsillustrated.cnn.com/football/ncaa/teams/ohiost/ ORsite:http://www.soonersports.com/ ORsite:http://www.gatorzone.com/

Ordinal Ranking of URIs from SE Queries

113

Weighting Ranked URLs

• If real-world resources are mapped to more than one URL (n > 1)

•Need to accumulate ranking score

•Determine one final overall school score

•Assign weights per URL depending on their rank

P - Position of URL in result set

T - Total number of URLs in the list (n * number of teams)

Weighting Ranked URIs

114

Correlation Results

Kendall Tau used to test for statistically significant (p<0.05) correlation

Top 10 AP Poll Top 10 USA Poll

Correlation Results

115

Correlation Results


“Inertia”

Correlation Results

Kendall Tau used to test for statistically significant (p<0.05) correlation

116

n-Values for Correlation


N-Values for Correlation

117

n-Values for Correlation

Top 25 AP Poll Top 25 USA Polln=2..6

N-Values for Correlation

118

Correlation of Overlapping URLsOver Time

USC Georgia Ohio State Oklahoma

Florida Missouri Texas Texas TechAlabam

a BYU Penn State Utah

• 12 schools occur in all AP polls throughout the season

•Given the “inertia”, by how much does the web trail?

•Can we measure a “delayed correlation”?

•Declare AP ranking for each week as separate “truth values”

•Compute correlation between truth values and search engine ranking

• Expect to see in increased correlation in the weeks following the truth value

Correlation of Overlapping URIs Over Time

119

Correlation of Overlapping URLsOver Time

n=8

Correlation of Overlapping URIs Over Time

120

Correlation between Attendanceand SE and Polls

AP USAToday

Googlen=6

Googlen=1

Correlation Between Attendance and SE and Polls

121

Concluding Remarks

• Inspired by “Does Authority mean Quality?” we asked “Does Quality mean Authority?”

• High correlations for the last seasons final rankings and rankings early in the season

• Correlation decreases because of “inertia”

• No correlation between attendance and search engine rankings

Conclusions

Although authority means quality, quality does not necessarily mean authority - at least not immediately.

Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010

Documents

Transcript of Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010