Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010
description
Transcript of Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein [email protected] 09/15/2010
2
The Problemhttp://www.jcdl2007.org
http://www.jcdl2007.org/JCDL2007_Program.pdf
3
The Problem
• Web users experience 404 errors• expected lifetime of a web page is 44 days [Kahle97]
• 2% of web disappears every week [Fetterly03]
• Are they really gone? Or just relocated?• has anybody crawled and indexed it?• do Google, Yahoo!, Bing or the IA have a copy of
that page?• Information retrieval techniques needed to
(re-)discover content
Web Infrastructure (WI) [McCown07]
• Web search engines (Google, Yahoo!, Bing) and their caches
• Web archives (Internet Archive)• Research projects (CiteSeer)
4
The Environment
Digital preservation happens in the WI
5
Refreshing and Migration in the WI
Google Scholar
CiteSeerX
Internet Archive
1same URI maps to same or very similar content at a later time
2
same URI maps to different content at a later time
3
different URI maps to same or very similar content at the same or at a later time
4
the content can not be found at any URI
6
URI – Content Mapping Problem
U1
C1
U1
C1
timeA B
U1
C2
U1
C1
timeA B
U2
C1
U1
C1
U1
404
timeA B
U1
???
U1
C1
timeA B
Content Similarity
7
JCDL 2005http://www.jcdl2005.org/
July 2005http://www.jcdl2005.org/
Today
Content Similarity
8
Hypertext 2006http://www.ht06.org/
August 2006http://www.ht06.org/
Today
Content Similarity
9
PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.html
August 2003http://www.pspcentral.org/events/archive/annual_meeting_2003.html
Today
Content Similarity
10
ECDL 1999http://www-rocq.inria.fr/EuroDL99/
October 1999http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html
Today
Content Similarity
11
Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm
1999Today
? ?
LS
RemovalHit
RateProxyCache
GoogleYahoo
• First introduced by Phelps and Wilensky [Phelps00]
• Small set of terms capturing “aboutness” of a document, “lightweight” metadata
12
Lexical Signatures (LSs)
ResourceAbstract
• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88]
• Term frequency (TF):– “How often does this word appear in this
document?”• Inverse document frequency (IDF):
– “In how many documents does this word appear?”
13
Generation of Lexical Signatures
• “Robust Hyperlink”• 5 terms are suitable• Append LS to URL
http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago
• Limitations:1. Applications (browsers) need to be modified to
exploit LSs2. LSs need to be computed a priori3. Works well with most URLs but not with all of
them 14
LS as Proposed by Phelps and Wilensky
• Park et al. [Park03] investigated performance of various LS generation algorithms
• Evaluated “tunability” of TF and IDF component
• Weight on TF increases recall (completeness)• Weight on IDF improves precision (exactness)
15
Generation of Lexical Signatures
Rank/Results URL LS
1/1 http://www.cs.berkeley.edu/˜wilensky/NLP.html
texttiling wilensky disambiguation subtopic iagohttp://www.google.com/search?q=texttiling+wilensky+disambiguation+subtopic+iago
na/10 http://www.dli2.nsf.gov nsdl multiagency imls testbeds extramuralhttp://www.google.com/search?q=nsdl+multiagency+imls+testbeds+extramural
1/221,000(1/174,000 in
01/2008)
http://www.loc.gov library collections congress thomas americanhttp://www.google.com/search?q=library+collections+congress+thomas+american
1/51(2/77 in
01/2008)
http://www.jcdl2008.org libraries jcdl digital conference psthttp://www.google.com/search?q=libraries+jcdl+digital+conference+pst
16
Lexical Signatures -- Examples
17
Synchronicity
404 error occurs while browsing look for same or older page in WI (1)if user satisfied return page (2)else generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page (4) else get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages (6)
The system may not return any results at all
18
Synchro…What?
Synchronicity• Experience of causally unrelated events
occurring together in a meaningful manner• Events reveal underlying pattern, framework
bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)
• “meaningful coincidence”• Deschamps – de Fontgibu plum
pudding example
picture from http://www.crystalinks.com/jung.html
19
404 Errors
20
404 Errors
21
“Soft 404” Errors
22
“Soft 404” Errors
A Comparison of Techniques for Estimating IDF Values to Generate
Lexical Signatures for the Web(WIDM 2008)
• LSs are usually generated following the TF-IDF scheme
• TF rather trivial to compute• IDF requires knowledge about:
• overall size of the corpus (# of documents)• # of documents a term occurs in
• Also not complicated to compute for bounded corpora (such as TREC)
• If the web is the corpus, values can only be estimated
The Problem
• Use IDF values obtained from 1. Local collection of web pages2. ``screen scraping‘‘ SE result pages
• Validate both methods through comparison to baseline
• Use Google N-Grams as baseline• Note: N-Grams provide term count (TC)
and not DF values – details to come
The Idea
26
Accurate IDF Values for LSs
Screen scraping the Google web interface
27
The Dataset
Local universe consisting of copies of URLs from the IAbetween 1996 and 2007
Same as above, follows Zipf distribution
10,493 observations254,384 total terms16,791 unique terms
The Dataset
Total terms vs new terms
The Dataset
Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms
LSs Example
1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k
2. Kendall Tau• Modified version since LSs to compare
may contain different terms3. M-Score
• Penalizes discordance in higher ranks
Comparing LSs
Top 5, 10 and 15 terms
LC – local universe
SC – screen scraping
NG – N-Grams
Comparing LSs
• Both methods for the computation of IDF values provide accurate results• compared to the Google N-Gram baseline
• Screen scraping method seems preferable since• similaity scores slightly higher• feasible in real time
Conclusions
Correlation of Term Count and Document Frequency for Google N-Grams
(ECIR 2009)
• Need of a reliable source to accurately compute IDF values of web pages (in real time)
• Shown, screen scraping works but• missing validation of baseline (Google N-
Grams)• N-Grams seem suitable (recently created,
based on web pages) but provide TC and not DF what is their relationship?
The Problem
36
Background & Motivation
• Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept• Used (among others) to generate lexical signatures (LSs)
• TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated!
• Most text corpora provide term count values (TC)
D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”
TC >= DF but is there a correlation? Can we use TC to estimate DF?
Term All Buy Can’t Is Love Me Need Please You Long
TC 1 1 1 1 2 2 1 2 1 3
DF 1 1 1 1 2 2 1 1 1 1
• Investigate relationship between:• TC and DF within the Web as Corpus (WaC)• WaC based TC and Google N-Gram based TC
• TREC, BNC could be used but:• they are not free• TREC has been shown to be somewhat dated
[Chiang05 ]
The Idea
• Analyze correlation of list of terms ordered by their TC and DF rank by computing:• Spearman‘s Rho• Kendall Tau
• Display frequency of TC/DF ratio for all terms• Compare TC (WaC) and TC (N-Grams)
frequencies
The Experiment
39
Experiment Results
Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)
Rank similarity of all terms
40
Experiment Results
Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)
Spearman’s ρ and Kendall τ
41
Experiment Results
Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM
10 GRANT ARIA PAPERS GRANT
Google: screen scraping DF (?) values from the Google web interface
Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr
U = 14∩ = 6
Strong indicator that TC can be used to estimate DF for web pages!
Integer ValuesTwo Decimals One Decimal
Frequency of TC/DF Ratio Within the WaC
Experiment Results
43
Experiment ResultsShow similarity between WaC based TC and
Google N-Gram based TC
TC frequencies
N-Grams have a threshold of 200
• TC and DF Ranks within the WaC show strong correlation
• TC frequencies of WaC and Google N-Grams are very similiar
• Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages
Does not mean everything correlated to TC can be used as DF substitude!
Conclusions
Inter-Search EngineLexical Signature Performance
(JCDL 2009)
Inter-Search EngineLexical Signature Performance
Martin Klein Michael L. Nelson{mklein,mln}@cs.odu.edu
http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta
Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks
Asian, Trunk
47
Revisiting Lexical Signatures to(Re-)Discover Web Pages
(ECDL 2008)
49
How to Evaluate the Evolution of LSs over Time
Idea: • Conduct overlap analysis of LSs• LSs based on local universe mentioned above
• Neither Phelps and Wilensky nor Park et al. did that• Park et al. just re-confirmed their findings after 6
month
50
Dataset
Local universe consisting of copies of URLs from the IAbetween 1996 and 2007
10-term LSs generated forhttp://www.perfect10wines.com
LSs Over Time - Example
52
LS Overlap Analysis
Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observed
Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last
53
Evolution of LSs over Time
Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return
Rooted
54
Evolution of LSs over Time
Results:• Overlap increases over time• Seem to reach steady state around 2003
Sliding
55
Performance of LSs
Idea: • Query Google search API with LSs• LSs based on local universe mentioned above• Identify URL in result set
• For each URL it is possible that:1. URL is returned as the top ranked result2. URL is ranked somewhere between 2 and 103. URL is ranked somewhere between 11 and 1004. URL is ranked somewhere beyond rank 100
considered as not returned
56
Performance of LSs wrt Number of Terms
Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best
• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered
• 8 terms and beyond do not show improvement
57
Performance - Number of Terms
• Lightest gray = rank 1
• Black = rank 101 and beyond
• Ranks 11-20, 21-30,… colored proportionally
• 50% top ranked, 20% in top 10, 30% black
Rank distribution of 5 term LSs
Performance of LSs wrt Number of Terms
58
Performance of LSs
Scoring (generalized from Park et al.)Equation in Section 6.1
• Fair:• Gives credit to all URLs equally with linear spacing
between ranks• Optimistic:
• Bigger penalty for lower ranks
• Scores for the position of a URL in a list of 10:• Fair: 10/10, 9/10, 8/10 … 1/10, 0• Optimistic: 1/1, 1/2, 1/3 … 1/10, 0
59
Fair and optimistic score for LSs consisting of 2-15 terms(mean values over all years)
Performance of LSs wrt Number of Terms
60
Performance of LSs over Time
Score for LSs consisting of 2, 5, 7 and 10 terms
Fair Optimistic
• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize
• 5-, 6- and 7-term LSs seem to perform best• 7 – most top ranked• 5 – fewest undiscovered• 5 – lowest mean rank
• 8 terms and beyond hurt performance
Conclusions
Evaluating Methods to Rediscover Missing Web Pages from theWeb Infrastructure
(JCDL 2010)
63
The Problem
Internet Archive - Wayback Machine
63
www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com
Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry
TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International
59 copies
The Problem
64
The Problem
64
www.aircharter-international.com
Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry
The Problem
65
The Problemwww.aircharter-international.com
TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International
The Problem
66
The Problem
If no archived/cached copy can be found...
Tags
C?B
A
Link Neighborhood (LNLS)
The Problem
67
The ProblemThe Problem
68
Contributions
• Compare performance of four automated methods to rediscover web pages1. Lexical signatures (LSs) 3. Tags
2. Titles 4. LNLS
• Analysis of title characteristics wrt their retrieval performance
• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery
Contributions
69
Experiment - Data Gathering
• 500 URIs randomly sampled from DMOZ
• Applied filters– .com, .org, .net, .edu domains
– English Language
– min. of 50 terms [Park]
• Results in 309 URIs to download and parse
Data Gathering
70
Experiment - Data Gathering
• Extract title– <Title>...</Title>
• Generate 3 LSs per page– IDF values obtained from Google, Yahoo!, MSN Live
• Obtain tags from delicious.com API (only 15%)
• Obtain link neighborhood from Yahoo! API (max. 50 URIs)– Generate LNLS
– TF from “bucket” of words per neighborhood
– IDF obtained from Yahoo! API
Data Gathering
71
LS Retrieval Performance
5- and 7-Term LSs
•Yahoo! returns most URIs top ranked and leaves least undiscovered
•Binary retrieval pattern, URI either within top 10 or undiscovered
LS Retrieval Performance
72
Title Retrieval Performance
Non-Quoted and Quoted Titles
•Results at least as good as for LSs
•Google and Yahoo! return more URIs for non-quoted titles
•Same binary retrieval pattern
Title Retrieval Performance
73
Tags Retrieval Performance
•API returns up to top10 tags - distinguish between # of tags queried
•Low # of URIs
Tags Retrieval Performance
74
LNLS Retrieval Performance
•5- and 7-term LNLSs
•< 5% top ranked
LNLS Retrieval Performance
75
Query LNLS
Combination of Methods
Can we achieve better retrieval performance if we combine 2 or more methods?
Done
Done
Done
Query Tags
Query Title
Query LS
Combination of Methods
76
Combination of Methods
Top Top10 UndisLS5 50.8 12.6 32.4LS7 57.3 9.1 31.1TI 69.3 8.1 19.7TA 2.1 10.6 75.5 Top Top10 Undis
LS5 67.6 7.8 22.3LS7 66.7 4.5 26.9TI 63.8 8.1 27.5TA 6.4 17.0 63.8Top Top10 Undis
LS5 63.1 8.1 27.2LS7 62.8 5.8 29.8TI 61.5 6.8 30.7TA 0 8.5 80.9
Yahoo!
MSN Live
Combination of Methods
77
Combination of Methods
Google Yahoo! MSN LiveLS5-TI 65.0 73.8 71.5LS7-TI 70.9 75.7 73.8TI-LS5 73.5 75.7 73.1TI-LS7 74.1 75.1 74.1
LS5-TI-LS7 65.4 73.8 72.5LS7-TI-LS5 71.2 76.4 74.4TI-LS5-LS7 73.8 75.7 74.1TI-LS7-LS5 74.4 75.7 74.8
LS5-LS7 52.8 68.0 64.4LS7-LS5 59.9 71.5 66.7
Top Results for Combination of Methods
Combination of Methods
78
•Length varies between 1 and 43 terms
•Length between 3 and 6 terms occurs most frequently and performs well [Ntoulas]
Title Characteristics
Length in # of Terms
Title Characteristics
79
•Length varies between 4 and 294 characters
•Short titles (<10) do not perform well
•Length between 10 and 70 most common
•Length between 10 and 45 seem to perform best
Title Characteristics
Length in # of Characters
Title Characteristics
80
•Title terms with a mean of 5,6,7 characters seem most suitable for well performing terms
•More than 1 or 2 stop words hurts performance
Title Characteristics
Mean # of Characters, # of Stop Words
Title Characteristics
81
Concluding Remarks
Lexical signatures, as much as titles, are very suitable as search engine queries to rediscover missing web pages. They return 50-70% URIs top ranked.
Tags and link neighborhood LSs do not seem to significantly contribute to the retrieval of the web pages.
Titles are much cheaper to obtain than LSs.The combination of primarily querying titles and 5-term LSs as a second option returns more than 75% URIs top ranked.
Not all titles are equally good.Titles containing between 3 and 6 terms seem to perform best. More than a couple of stop words hurt the performance.
Conclusions
Is This a Good Title?(Hypertext 2010)
83
The Problem
Professional Scholarly Publishing 2003http://www.pspcentral.org/events/annual_meeting_2003.html
The Problem
84
The Problem
Internet Archive - Wayback Machine
84
www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com
Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry
TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International
59 copies
The Problem
85
The Problem
85
www.aircharter-international.com
Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry
The Problem
86
The Problemwww.aircharter-international.com
TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International
The Problem
87
The Problemhttp://www.drbartell.com/
Lexical Signature(TF/IDF)Plastic Surgeon Reconstructive Dr Bartell Symbol University
???
The Problem
88
The Problemhttp://www.drbartell.com/
TitleThomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery
The Problem
89
The Problem
89
www.reagan.navy.mil
Lexical Signature(TF/IDF)Ronald USS MCSN Torrey Naval Sea Commanding
The Problem
90
The Problem
TitleHome Page ???
www.reagan.navy.mil
Is This a Good Title?
The Problem
91
Contributions
• Discuss discovery performance of web pages titles (compared to LSs)
• Analysis of discovered pages regarding their relevancy
• Display title evolution compared to content evolution over time
• Provide prediction model for title’s retrieval potential
Contributions
92
Experiment - Data Gathering
• 20k URIs randomly sampled from DMOZ
• Applied filters– English language – min. of 50 terms
• Results in 6.875 URIs
• Downloaded and parsed the pages
• Extract title and generate LS per page (baseline).com .org .net .edu sum
Original 15289 2755 1459 497 20000Filtered 4863 1327 369 316 6875
Data Gathering
93
Title (and LS) Retrieval Performance
Titles 5- and 7-Term LSs
•Titles return more than 60% URIs top ranked
•Binary retrieval pattern, URI either within top 10 or undiscovered
Title and LS Retrieval Performance
94
???
Relevancy of Retrieval Results
•Distinguish between discovered (top 10) and undiscovered URIs
•Analyze content of top 10 results
•Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank
Do titles return relevant results besides the original URI?
Relevancy of Retrieval Results
95
Relevancy of Retrieval Results
Term OverlapDiscovered Undiscovered
High relevancy in the top rankswith possible aliases and duplicates.
Relevancy of Retrieval Results
96
Relevancy of Retrieval Results
ShinglesDiscovered Undiscovered
More optimal shingles values than top ranked URIs - possible aliases and duplicates.
Relevancy of Retrieval Results
97
1998-01-27Sun Software Products Selector Guides - Solutions Tree
1999-02-20Sun Software Solutions
2002-02-01Sun Microsystems Products
2002-06-01Sun Microsystems - Business & Industry Solutions
2003-08-01Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions
Title Evolution - Example I
2004-02-02Sun Microsystems – Solutions
2004-06-10Gateway Page - Sun Solutions
2006-01-09Sun Microsystems Solutions & Services
2007-01-03Services & Solutions
2007-02-07Sun Services & Solutions
2008-01-19Sun Solutions
www.sun.com/solutions
Title Evolution – Example I
98
2000-06-19DataCity of Manassas Park Main Page
2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives
2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives
Title Evolution - Example II
2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free
2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB
www.datacity.com/mainf.html
Title Evolution – Example II
99
•Copies from fixed size time windows per year
•Extract available titles of past 14 years
•Compute normalized Levenshtein edit distance between titles of copies and baseline(0 = identical; 1 = completely dissimilar)
How much do titles change over time?
Title Evolution Over TimeTitle Evolution Over Time
100
Title Evolution Over Time
Title edit distance frequencies
•Half the titles of available copies from recent years are (close to) identical
•Decay from 2005 on (with fewer copies available)
•4 year old title:40% chance to be unchanged
Title Evolution Over Time
101
Title Evolution Over Time
Title vs Document•Y: avg shingle value
for all copies per URI
•X: avg edit distance of corresponding titles
•overlap indicated by:green: <10red: >90
•Semi-transparent: total amount of points plotted
[0,1] - over 1600 times
[0,0] - 122 times
Title Evolution Over Time
102
Title Performance Prediction
•Quality prediction of title by
•Number of nouns, articles etc.
•Amount of title terms, characters ([Ntoulas])
•Observation of re-occurring terms in poorly performing titles - “Stop Titles”
home, index, home page, welcome, untitled document
The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!
[Ntoulas]A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92
Title Performance Prediction
103
Concluding Remarks
The “aboutness” of web pages can be determined from either the content or from the title.
More than 60% of URIs are returned top ranked when using the title as a search engine query.
Titles change more slowly and less significantly over time than the web pages’ content.
Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor.
Conclusions
Comparing the Performance ofUS College Football Teams
in the Web and on the Field(Hypertext 2009)
105
Naming Conventions
Football
Soccer
Naming Conventions
106
Motivation
• “Does Authority mean Quality?”[Amento00]
• Link-based web page metrics can be used to estimate experts’ assessment of quality
• Lists compiled by experts are cool!
– Companies, schools, people, places, etc
• “Big 3” search engines play a central role in our lives
– “If I can’t find it in the top 10 it doesn’t exist in the web”
– SEOs
• Do expert rankings of real-world entities correlate with search engine ranking of corresponding web resources?
Motivation
107
Background
•Expert ranking of real-world entities:
•Collegiate football programs in the US
•Associated Press (AP) poll
•65 sportswriters and broadcasters
•USA Today Coaches poll
•63 college football head coaches
•Published once a week, top 25 teams, 25-1 point system
• “Big 3” search engines
•Google, Yahoo and MSN Live (APIs)
Background
108
US College Football Season 2008
•2008 season began on August 28th 2008
•Concluded January 8th 2009
•18 instances of poll data:
•Final polls from 2007 season (as a baseline)
•2008 pre-season polls
•once for each of the 16 weeks of the 2008 season
US College Football Season 2008
109
Mapping Resources to URLs
•Often impossible to distill the canonical URL for a football program
•e.g. Virginia Tech college football returned
•Official school page
•Commercial sports sites
•Wikipedia
•Blogs, Fan sites, etc
Mapping Resources to URIs
110
Mapping Resources to URLs
•Query 3 search engine APIs for representative URLs
•Query: schoolname+College+Football
•e.g.: Ohio+State+College+Football
•Aggregate the top 8 representative URLs (n = 1 .. 8)
•Temporal aspect in mind:
•Repeat query and renew aggregation weekly
Mapping Resources to URIs
111
Ordinal Ranking of URLs from SE Queries
We are not interested in computing search engine’s absolute ranking for a particular URL (PR values)
BUT
We are determining that a search engine ranks URLs in order
Ordinal Ranking of URIs from SE Queries
112
Ordinal Ranking of URLs from SE Queries
•Search engines enforce query restrictions (length, amount per day etc)
•Build unbiased and overlapping queries
•site and OR operators
•Variation of strand sort
USC Georgia Ohio State Oklahoma Florida
site:http://usctrojans.cstv.com/sports/m-footbl/usc-m-footbl-body.html ORsite:http://uga.rivals.com/ ORsite:http://sportsillustrated.cnn.com/football/ncaa/teams/ohiost/ ORsite:http://www.soonersports.com/ ORsite:http://www.gatorzone.com/
Ordinal Ranking of URIs from SE Queries
113
Weighting Ranked URLs
• If real-world resources are mapped to more than one URL (n > 1)
•Need to accumulate ranking score
•Determine one final overall school score
•Assign weights per URL depending on their rank
P - Position of URL in result set
T - Total number of URLs in the list (n * number of teams)
Weighting Ranked URIs
114
Correlation Results
Kendall Tau used to test for statistically significant (p<0.05) correlation
Top 10 AP Poll Top 10 USA Poll
Correlation Results
115
Correlation Results
Top 25 AP Poll Top 25 USA Poll
“Inertia”
Correlation Results
Kendall Tau used to test for statistically significant (p<0.05) correlation
116
n-Values for Correlation
Top 10 AP Poll Top 10 USA Poll
N-Values for Correlation
117
n-Values for Correlation
Top 25 AP Poll Top 25 USA Polln=2..6
N-Values for Correlation
118
Correlation of Overlapping URLsOver Time
USC Georgia Ohio State Oklahoma
Florida Missouri Texas Texas TechAlabam
a BYU Penn State Utah
• 12 schools occur in all AP polls throughout the season
•Given the “inertia”, by how much does the web trail?
•Can we measure a “delayed correlation”?
•Declare AP ranking for each week as separate “truth values”
•Compute correlation between truth values and search engine ranking
• Expect to see in increased correlation in the weeks following the truth value
Correlation of Overlapping URIs Over Time
119
Correlation of Overlapping URLsOver Time
n=8
Correlation of Overlapping URIs Over Time
120
Correlation between Attendanceand SE and Polls
AP USAToday
Googlen=6
Googlen=1
Correlation Between Attendance and SE and Polls
121
Concluding Remarks
• Inspired by “Does Authority mean Quality?” we asked “Does Quality mean Authority?”
• High correlations for the last seasons final rankings and rankings early in the season
• Correlation decreases because of “inertia”
• No correlation between attendance and search engine rankings
Conclusions
Although authority means quality, quality does not necessarily mean authority - at least not immediately.