WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n [email protected]...
-
Upload
steven-colin-grant -
Category
Documents
-
view
213 -
download
0
Transcript of WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n [email protected]...
![Page 1: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/1.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Martin Klein@mart1nkle1n
SiteStory Archiving Done Differently
http://mementoweb.github.io/SiteStory/
Justin F. [email protected]
![Page 2: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/2.jpg)
WADL 2013 July 25-26th Indianapolis, IN
LANL SiteStory Teamlead developer
![Page 3: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/3.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Archiving - the traditional way
• Actively crawl the web• For example, using Heritrix
![Page 4: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/4.jpg)
WADL 2013 July 25-26th Indianapolis, IN
• Issues with crawler based archiving:• Request can be rejected (robots.txt, user-agent, IP)• Can be deceived (geo-location, user-agent)• Can be trapped (crawl my calendar!)• Requires constant and massive bandwidth• Implied timing problem, when to crawl?
Archiving - the traditional way
![Page 5: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/5.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Timing problem:• Update 1 viewed but not archived
t1
Rcreated
t2
browservisit1
t3
crawlervisit1
t4
R update1
t5
browservisit2
t6
Rupdate2
Archiving - the traditional way
![Page 6: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/6.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Archiving - the SiteStory way
• Transactional Web archiving• Archive accepts HTTP transaction between browser
and server
![Page 7: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/7.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Timing problem:• Update 1 viewed and archived
t1
Rcreated
t2
browservisit1
t3
crawlervisit1
t4
R update1
t5
browservisit2
t6
Rupdate2
Archiving - the traditional way
![Page 8: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/8.jpg)
WADL 2013 July 25-26th Indianapolis, IN
![Page 9: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/9.jpg)
WADL 2013 July 25-26th Indianapolis, IN
• Challenges with transactional archiving:• To be archived server has to cooperate• Transfer data to archive, batch mode or real-time• Archive must trust transmission to be authentic• Resources from external servers have to be archived
out-of-band• Deduplication challenges
• Alias: different URI, same response• Conneg: same URI, different response• Determine “significant” content change
Archiving - the SiteStory way
![Page 10: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/10.jpg)
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Status Quo
• mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request• not for POST, DELETE, etc• for HTTP response codes 200, 302, 303
• Client IP can be included in stored headers, configurable• Header info stored in BerkeleyDB, response body in FS• Dedup via hash(body)• Offloading content as WARC files possible
(read: recommended)
![Page 11: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/11.jpg)
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Use Case
• http://www.dans.knaw.nl• LANL has been archiving the DANS website (forever)• ~32 GB since mid April 2013• >200k resources
![Page 12: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/12.jpg)
WADL 2013 July 25-26th Indianapolis, IN
To Appear: TPDL 2013
• SiteStory benchmark with ab & wgeto ApacheBench (ab): server stress test toolo wget: Web page download
- All content: -p • Local network• Negligible difference between
SiteStory and No SiteStory
![Page 13: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/13.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Re-executed on testbed
ws-dl-03.cs.odu.edu
x99
,…,
,
megalodon.lanl.gov
@AWS
![Page 14: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/14.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Testing with ab
![Page 15: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/15.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Testing with wget
![Page 16: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/16.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Round Trip Time -- Distributed
![Page 17: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/17.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Results
• Distributed: Higher variance• Increased delay due to network• On vs. Off Comparison still comparable• Viable solution without crippling service
![Page 18: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/18.jpg)
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Installation
• Apache module mod_sitestory• Option to exclude a list of directories
• SiteStory Web Archive• Trivial for existing Tomcat environments• Tanuki Java wrapper (stand-alone) available
• Configure, open ports, go!
Or…
![Page 19: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/19.jpg)
WADL 2013 July 25-26th Indianapolis, IN
SiteStory Testbed
We have a SiteStory Web Archive installed for you!
1. Install and configure mod_sitestory
2. Send an email containing:
1. Your contact info
2. Web server IP address
3. Server domain name used
3. Happy Sitestory’ing!
mailto: [email protected]
http://mementoweb.github.io/SiteStory/
![Page 20: WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649e4a5503460f94b3da0f/html5/thumbnails/20.jpg)
WADL 2013 July 25-26th Indianapolis, IN
Martin Klein@mart1nkle1n
SiteStory Archiving Done Differently
http://mementoweb.github.io/SiteStory/
Justin F. [email protected]