Reconstructing the past with media wiki

28
Reconstructing the past with MediaWiki: Programmatic Issues and Solutions Shawn M. Jones [email protected] Old Dominion University

description

The Internet Archive attempts to reconstruct web pages via snapshots (Mementos) that are taken of pages at various points in time. Many pages change more frequently than the Internet Archive can capture them, meaning that some revisions of a given web page are lost forever. Mediawiki, however, has all past revisions of a given page, and also its associated external resources. This inspired the development of the Memento Mediawiki Extension as an improvement over the Internet Archive's "drive by" method of digital preservation where Mediawiki sites are involved. While working on the Memento Mediawiki Extension, effort was put into reconstructing past revisions of each Wiki page. The existing software reconstructs the page text as per RFC 7089, but does not try to pull in past versions of images, JavaScript, CSS, and other external resources, because Mediawiki, as it exists, makes it difficult or impossible to load these resources at page generation time. This curated talk will explore the problems of page reconstruction on the main web and detail the issues within the Mediawiki code that currently prevent and/or make it difficult to reconstruct the page in its totality as it looked at that revision.

Transcript of Reconstructing the past with media wiki

Page 1: Reconstructing the past with media wiki

Reconstructing the past with MediaWiki:

Programmatic Issues and Solutions

Shawn M. [email protected]

Old Dominion University

Page 2: Reconstructing the past with media wiki

Reconstructing the Past with the Internet Archive

HTML

Images

JavaScript

CSS

Our goal: Temporal CoherenceMake the page look as it looked at the time it was archived.

Page 3: Reconstructing the past with media wiki

Some Results from the Internet Archive Are Lacking

Images change between the time the Archive crawls the main page and the time it gets to the images

Sometimes embedded images are missing when the Archive gets to them

Sometimes the page is designed for a specific browser in mind

Image from “A Framework for Evaluation of Composite Memento Temporal Coherence” by S. Ainsworth, M. L. Nelson, H. Van de Sompel. http://arxiv.org/abs/1402.0928

Page 4: Reconstructing the past with media wiki

MediaWiki Shouldn’t Have This Problem

HTML Images

JavaScript

CSS

Page 5: Reconstructing the past with media wiki

What we’re not doing

Page 6: Reconstructing the past with media wiki

Interest in Reconstructing the Past With MediaWiki

Page 7: Reconstructing the past with media wiki

Simplified Memento Overview

Page 8: Reconstructing the past with media wiki

Rules for Reconstructing the Past With MediaWiki

Do not modify any existing MediaWiki code!

Conform to MediaWiki coding standards

And…

Page 9: Reconstructing the past with media wiki

Reconstructing the Past

Articles

Templates

Embedded Images

Embedded JavaScript

Embedded CSS

Page 10: Reconstructing the past with media wiki

Accessing Old Article Text

The oldid argument references a revision of a page within MediaWiki's database

Merely visiting the URI with the oldid will give you the text content of the page as it existed at that revision

Page 11: Reconstructing the past with media wiki

Reconstructing the Past

ArticlesHandled by Memento MediaWiki Extension

Templates

Embedded Images

Embedded JavaScript

Embedded CSS

Page 12: Reconstructing the past with media wiki

Including the Right Template

This gives us:$title - the Title object for the given page$parser - the Parser object for the given page$id - the revision ID (oldid) for the Template page

Using $parser, and $title, we can change the $id and fetch an old revision of the Template

Page 13: Reconstructing the past with media wiki

Reconstructing the Past

ArticlesHandled by Memento MediaWiki Extension

TemplatesHandled by Memento MediaWiki Extension

Embedded Images

Embedded JavaScript

Embedded CSS

Page 14: Reconstructing the past with media wiki

But What About Images?

This Map is important to understanding the content of this article

This image is changed as the article is changed, to reflect its content

Page 15: Reconstructing the past with media wiki

It’s the same map if we look at the June 6, 2013 revision now

Users can't view this embedded resource as it looked on June 2013 while reading the article from that time period

Page 16: Reconstructing the past with media wiki

What should have happenedThis is the the map from June, 2013 that should have been displayed

This is the current map

The content of the article won't match the data in this visual aide, possibly confusing a user who wanted historical information on this topic

Page 17: Reconstructing the past with media wiki

We Tried To Solve This

Upon further inspection of the code in MediaWiki, the $time argument from this function is never used as detailed here

Page 18: Reconstructing the past with media wiki

We Just Solved This

Upon further inspection of the code in MediaWiki, the $file argument’s getHistory() function can be used to acquire previous revisions of images

Page 19: Reconstructing the past with media wiki

Reconstructing the Past

ArticlesHandled by Memento MediaWiki Extension

TemplatesHandled by Memento MediaWiki Extension

Embedded ImagesPrototyped for future version ofMemento MediaWiki Extension

Embedded JavaScript

Embedded CSS

Page 20: Reconstructing the past with media wiki

What about CSS/JavaScript?

The present CSS of this page conflicts with the past Template.

Page 21: Reconstructing the past with media wiki

We Couldn’t Solve This

The data is present, but we could not find any way for an extension to access or render it.

Page 22: Reconstructing the past with media wiki

Recap on Reconstructing the Past

ArticlesHandled by Memento MediaWiki Extension

TemplatesHandled by Memento MediaWiki Extension

Embedded ImagesPrototyped for future version ofMemento MediaWiki Extension

Embedded JavaScriptRequires changes to MediaWiki

Embedded CSSRequires changes to MediaWiki

Page 23: Reconstructing the past with media wiki

Uniform solution

• RFC 7089, Memento, was designed to provide uniform access to past versions of all resources on the Web

• Memento provides a web standard to access these resources

Page 24: Reconstructing the past with media wiki

Resources• Memento Protocol: http://tools.ietf.org/html/rfc7089• Memento Website: http://www.mementoweb.org/• Memento MediaWiki Extension:

http://www.mediawiki.org/wiki/Extension:Memento• Memento Chrome Extension:

http://bit.ly/memento-for-chrome

• More details:http://ws-dl.blogspot.com/2014/04/2014-04-01-yesterdays-wiki-page-todays.html

• Contact me: [email protected]

Page 25: Reconstructing the past with media wiki

Backup Slides

Page 26: Reconstructing the past with media wiki

Sample URI-R (Step 1) HTTP Response

HTTP/1.1 200 OKDate: Sun, 25 May 2014 21:39:02 GMTServer: ApacheX-Content-Type-Options: nosniffLink: http://ws-dl-05.cs.odu.edu/demo/index.php/Daenerys\_Targaryen; rel="original latest-version",

http://ws-dl-05.cs.odu.edu/demo/index.php/Special:TimeGate/Daenerys\_Targaryen; rel="timegate",

http://ws-dl-05.cs.odu.edu/demo/index.php/Special:TimeMap/Daenerys\_Targaryen; rel="timemap”; type="application/link-format”Content-language: enVary: Accept-Encoding,CookieCache-Control: s-maxage=18000, must-revalidate, max-age=0Last-Modified: Sat, 17 May 2014 16:48:28 GMTConnection: closeContent-Type: text/html; charset=UTF-8

Page 27: Reconstructing the past with media wiki

Sample URI-G (Step 2) HTTP Response

HTTP/1.1 302 FoundDate: Sun, 25 May 2014 21:43:08 GMTServer: ApacheX-Content-Type-Options: nosniffVary: Accept-Encoding, Accept-DatetimeLocation: http://ws-dl-05.cs.odu.edu/demo/index.php?title=Daenerys_Targaryen&oldid=1499Link: <http://ws-dl-05.cs.odu.edu/demo/index.php/Special:TimeMap/Daenerys_Targaryen>; rel="timemap”; type="application/link-format",

<http://ws-dl-05.cs.odu.edu/demo/index.php/Daenerys_Targaryen>; rel="original latest-version”Connection: closeContent-Type: text/html; charset=UTF-8

Page 28: Reconstructing the past with media wiki

Sample URI-M (Step 3) HTTP Response

HTTP/1.1 200 OKDate: Sun, 25 May 2014 21:46:12 GMTServer: ApacheX-Content-Type-Options: nosniffMemento-Datetime: Sun, 22 Apr 2007 15:01:20 GMTLink: <http://ws-dl-05.cs.odu.edu/demo/index.php/Daenerys_Targaryen>; rel="original latest-version”,

<http://ws-dl-05.cs.odu.edu/demo/index.php/Special:TimeGate/Daenerys_Targaryen>; rel="timegate”,

<http://ws-dl-05.cs.odu.edu/demo/index.php/Special:TimeMap/Daenerys_Targaryen>; rel="timemap”; type="application/link-format”Content-language: enVary: Accept-Encoding,CookieExpires: Thu, 01 Jan 1970 00:00:00 GMTCache-Control: private, must-revalidate, max-age=0Connection: closeContent-Type: text/html; charset=UTF-8