An Evaluation of Caching Policies for Memento TimeMaps
-
Upload
justin-brunelle -
Category
Technology
-
view
1.137 -
download
2
description
Transcript of An Evaluation of Caching Policies for Memento TimeMaps
An Evaluation of Caching Policies for Memento TimeMaps
Justin F. Brunelle and Michael L. NelsonOld Dominion University
{jbrunelle, mln}@cs.odu.edu
JCDL 2013Indianapolis, Indiana
07/2013
Discovering Archived nasa.gov Pages
Archived Pages => mementosMementos identified by URI-M
Live Pages => resourcesResources identified by URI-R
2
3
TimeMaps: Lists of mementos<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original",
<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT",
<http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT",
<http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT",
<http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT",
<http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT",
<http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT",
<http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT",
<http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT",
<http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",
<http://api.wayback.archive.org/memento/20080903053412/http://www.nasa.gov/>;rel="memento";datetime="Wed, 03 Sep 2008 05:34:12 GMT",
<http://webarchive.nationalarchives.gov.uk/20080904014810/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 00:00:00 GMT",
<http://api.wayback.archive.org/memento/20080904055742/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 05:57:42 GMT",
<http://webarchive.nationalarchives.gov.uk/20080906134025/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 00:00:00 GMT",
<http://api.wayback.archive.org/memento/20080906143204/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 14:32:04 GMT",
<http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",
<http://api.wayback.archive.org/memento/20080907160232/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 16:02:32 GMT",
<http://webarchive.nationalarchives.gov.uk/20120809003120/http://www.nasa.gov/>;rel="memento";datetime="Thu, 09 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120814175606/http://www.nasa.gov/>;rel="memento";datetime="Tue, 14 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120819212348/http://www.nasa.gov/>;rel="memento";datetime="Sun, 19 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120826185010/http://www.nasa.gov/>;rel="memento";datetime="Sun, 26 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120909230516/http://www.nasa.gov/>;rel="last memento";datetime="Sun, 09 Sep 2012 00:00:00 GMT"
<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT"
http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",
4
Aggregating TimeMapes
• Multiple archives• Expensive• Caching reduces
load on archives• Write-through
Cache
Aggre-gator
Sort
IA TM
AIT TM
HTTPCache
…
5
Aggregator Cache
• TimeMaps change• Only want to cache better TimeMaps
– Bigger is better
• Ideally monotonically increasing• Two extremes:
– Never cache (TTL=0)– Never update in cache (TTL=92)
6
Agenda
7
Cache content measures
• |a| => # of archives<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/
>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,
• |m| => # of mementos<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/
>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,
8
Same TimeMap
• |a| == |a'|• |m| == |m'|All archives have reported the same mementos.
TimeMap T
9
mm mm
mm
TimeMap T'
mm mm
mm
|a| = 2; |m| = 3 |a| = 2; |m| = 3
Gained Archives, Gained Mementos• |a| < |a`|• |m| < |m`|A new archive (WebCite) has just indexed and
reported a memento for the first time.
10
TimeMap T
mm mm
mm
TimeMap T'
mm mm
mm
mm
|a| = 2; |m| = 3 |a| = 3; |m| = 4
• |a| == |a`|• |m| < |m`|The Internet Archive has released a set of new
mementos.
11
TimeMap T
mm mm
mm
TimeMap T'
mm mm
mm mm
Same Archives, Gained Mementos
|a| = 2; |m| = 3 |a| = 2; |m| = 4
Lost Archives, Same Mementos• |a| > |a`|• |m| == |m`|A redaction of 1 memento took place in the Internet Archive which
now does not report mementos for this resource. The UK Web Archive has released 1 new memento for this resource.
1212
TimeMap T '
mm mm
mm
TimeMap T
mm
mm
mm
|a| = 3; |m| = 3 |a| = 2; |m| = 3
Lost Archives, Gained Mementos• |a| > |a`|• |m| < |m`|A redaction of 2 mementos took place in the Internet Archive which
now does not report mementos for this resource. The UK Government Web Archive has released 3 new mementos for
this resource.
13
TimeMap T
mm mm
mm
TimeMap T'
mm
mmmm
mm
|a| = 2; |m| = 3 |a| = 1; |m| = 4
Lost Archives, Lost Mementos• |a| > |a`|• |m| > |m`|Archive-It has removed a collection, and no longer reports
those mementos. No other archives have new mementos of those resources.
14
TimeMap T
mm mm
mm
TimeMap T'
mm
|a| = 2; |m| = 3 |a| = 1; |m| = 1
Gained Archives, Lost Mementos• |a| < |a`|• |m| > |m`|A new archive (WebCite) has just indexed and reported 1 memento for
the first time.A server error at the Internet Archive caused an omission of 2
mementos.
15
TimeMap T
mm mm
mm
|a| = 2; |m| = 4
TimeMap T'
mm
mm
mm
|a| = 3; |m| = 3
mm
Agenda
16
Experiment Design
• Eliminate caching from local Memento proxies• Daily observations of 4,000 TimeMaps for 92 days in 2013• TimeMaps analyzed for changes & cardinality• Investigated caching policies• Outages observed from Memento/archives/department
17
ObservationsOccurrence Description Action
77.4% Unchanged TimeMap Do not update cache
19.7% Lost archives, lost mementos Do not update cache
2.4% Gained archives, gained mementos Update cache
0.4% Same archives, gained mementos Update cache
0.1% Gained archives, lost mementos Do not update cache
0.01% Lost archives, same mementos Update cache
0.01% Lost archives, gained mementos Update cache
18
Impact of Change in TimeMaps
• Caching transient errors– Not returned or not archived?
19
Cardinality of TimeMaps<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original", <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT", <http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT", <http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT", <http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT", <http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT", <http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT", <http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT", <http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT", <http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",
…
|TM| ?
20
Strict vs. Loose Matching• Different archive, URI-M, datetime- Strict: 2, Loose: 2
<http://api.wayback.archive.org/memento/20080509125659/http://flare.prefuse.org/>;rel="memento";datetime="Fri, 09 May 2008 12:56:59 GMT",<http://webarchive.nationalarchives.gov.uk/20080908074106/http://flare.prefuse.org/>;rel="memento"; datetime="Mon, 08 Sep 2008 00:00:00 GMT",
• Same archive, datetime, different URI-M- Strict: 3, Loose: 1<http://web.archive.org/web/20101101060204/http://aarp.org:80/Health/>;rel="memento";
datetime="Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/Health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",
• Same archive, different URI-M, bad datetime- Strict: 2, Loose: 2<http://wayback.archive-it.org/2342/20110321192906/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"
<http://wayback.archive-it.org/2354/20110321035356/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"
21
Strict vs. Loose: translate.google.com
22
Agenda
23
Testing• TTLs [0, 92]
– 0: Thrashed cache, best freshness– 92: First TimeMap cached, no replacement
• Policies– Unconditional
• Cardinality ignored
– Conditional• Replacements occur when cardinality is better
24
Evaluation
• Minimize cost values:– Q – Queries to the archives– MemDays – number of missed mementos/day
• Calculated MemDays: mementos missed/day
TTL: ∞
TTL: 0 MemDays
Q
25
MemDays
26
6
|TM|=10
MemDay=8
Optimal TTLUnconditional
Conditional
Optimal TTL= 9
Optimal TTL= 15
27
Agenda
28
Conclusion & Future Work
• 3-month observation of 4,000 TimeMaps• Change patterns studied
– 80.2% of TimeMaps monotonically increase– Others decrease
• Optimal TTL = 15 days• Cache Improvements:
– Saves requests to the archives
• Worth reinvestigating– Changed Memento landscape
29
Backups
30
www.nasa.gov 1996 - 2012
31
MementoIntegrates the past and present web
Now
Always Current
2008 2006 200120082010
32
33
Cardinality• Size of a TimeMap
– # Archives?– # Date times?
• TimeMaps:
• Cardinality:
• Monotonic Increase:
34