Formats Over Time: Exploring UK Web History
-
Upload
andy-jackson -
Category
Technology
-
view
1.785 -
download
0
description
Transcript of Formats Over Time: Exploring UK Web History
![Page 1: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/1.jpg)
Formats over Time Exploring UK Web History
Andrew Jackson UK Web Archive, The British Library
iPres 2012 | 04-10-2012 | Toronto
![Page 2: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/2.jpg)
DEBATING OBSOLESCENCE Formats over Time
![Page 3: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/3.jpg)
Rothenberg & Rosenthal On Format Obsolescence
Jeff Rothenberg: “Digital Information Lasts Forever –
Or Five Years, Whichever Comes First.” (1997) “…still apt…” (2012)
David Rosenthal:
“when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)
That network effects inhibit obsolescence
Where is the evidence?
![Page 4: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/4.jpg)
AN EXPERIMENT Formats over Time
![Page 5: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/5.jpg)
UK Web Domain Dataset (1994-2010)
UK Web Domain Dataset (1994-2010) From the Internet Archive Millions of websites > 2.5 billion resources > 400,000 ARC/WARC files > 35TB
Execution at Scale Stored on HDFS Map-Reduce
![Page 6: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/6.jpg)
Identification Tools
DROID Well-known in digital preservation community Format version level identification Minor problem concerning file handles Only binary signature part (DROID-B) could be embedded
Apache Tika Widely used identification and data extraction tool Identifies many formats at the MIME type level Easy to embed and extend
Added ability to extract e.g. software identifiers Minor bug concerning identification buffer size
![Page 7: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/7.jpg)
A Common Language For Format Identifiers
Comparison and combination requires a common model Map PRONOM IDs to extended MIME Types
fmt/18 becomes application/pdf; version=1.4
Allows easy comparison at sub-type level Can easily extend to cover other properties:
text/plain; charset=UTF-8
application/pdf; software=“Adobe Acrobat 6.0”
Also extended Tika to output details from PDFs
![Page 8: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/8.jpg)
Format Profile Dataset
Server, Tika & DROID-B format profiles, over time:
image/png image/png image/png; version=1.0 2004 102!
application/pdf !application/pdf; version=1.2; software="Acrobat
Distiller 4.0 for Windows"; source="Adobe PageMaker 6.0" !
application/pdf; version=1.2 !2004 !1 CC0 – free to download and reuse
http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/ Please cite us and/or let us know if you use it
Source code of all tools and modifications also available https://github.com/openplanets/nanite
![Page 9: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/9.jpg)
COMPARING TOOLS Results
![Page 10: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/10.jpg)
Coverage & Depth
0%#
1%#
10%#
100%#
1996#1997#1998#1999#2000#2001#2002#2003#2004#2005#2006#2007#2008#2009#2010#
Percen
tage)of)resou
rces)
uniden
0fied
)
Year)
DROID1B#v.59#
Apache#Tika#1.1#
No format-version-level information from Apache Tika.
![Page 11: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/11.jpg)
Inconsistencies
Gaps 37 formats spotted by DROID-B but not Tika
Notably includes earlier Office formats 129 formats spotted by Tika but not DROID-B
But at least 20 are due to not using the full DROID Conflicts
Failed MIME type mapping, e.g. PDF 1.7 (since fixed) ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone) DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…
Both tools bad at non-HTML/XML text formats CSS, scripting languages like JS, CSV, TSV, etc.
![Page 12: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/12.jpg)
FORMATS OVER TIME Results
![Page 13: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/13.jpg)
Image Formats Over Time
0.00001%%
0.00010%%
0.00100%%
0.01000%%
0.10000%%
1.00000%%
10.00000%%
100.00000%%
1996%
1997%
1998%
1999%
2000%
2001%
2002%
2003%
2004%
2005%
2006%
2007%
2008%
2009%
2010%
Percen
tage)of)crawl)
Year)
JPEG%
GIF%
PNG%
ICON%
XBM%
TIFF%
![Page 14: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/14.jpg)
HTML Versions Over Time
HTML%2.0%
HTML%3.2%HTML%4.0%
HTML%4.01%
XHTML%1.0%
0%%10%%20%%30%%40%%50%%60%%70%%80%%90%%100%%
1996%1997%1998%1999%2000%2001%2002%2003%2004%2005%2006%2007%2008%2009%2010%Pe
rcen
tage)of)H
TML)Re
sources)
Year)
![Page 15: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/15.jpg)
PDF Versions Over Time
1.0$
1.1$
1.2$1.3$
1.4$
1.5$1.6$
0%$10%$20%$30%$40%$50%$60%$70%$80%$90%$
100%$
1996$1997$1998$1999$2000$2001$2002$2003$2004$2005$2006$2007$2008$2009$2010$Pe
rcen
tage)of)P
DF)Resou
rces)
Year)
![Page 16: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/16.jpg)
Format Usage Versus Time
1"10"100"
1,000"10,000"100,000"
1,000,000"10,000,000"
100,000,000"1,000,000,000"
10,000,000,000"
0" 2" 4" 6" 8" 10" 12" 14" 16" 18"
Num
ber'o
f'Resou
rces'in'Archive'
Timespan'[Years]'
![Page 17: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/17.jpg)
IMPLEMENTATIONS Results
![Page 18: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/18.jpg)
PDF Software Over Time
Acrobat(Dis,ller(
Acrobat(PDFWriter(
Acrobat(
0%(10%(20%(30%(40%(50%(60%(70%(80%(90%(
100%(
1996(1997(1998(1999(2000(2001(2002(2003(2004(2005(2006(2007(2008(2009(2010(Pe
rcen
tage)of)P
DF)Resou
rces)
Year)
Over 2100 Distinct PDF Software IDs
![Page 19: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/19.jpg)
JPEG Hardware Over Time
DS5$ CYBERSHOT$ E990$
MX1700$
NIKON$D40$
0%$10%$20%$30%$40%$50%$60%$70%$80%$90%$
100%$
1994$1995$1996$1997$1998$1999$2000$2001$2002$2003$2004$2005$2006$2007$2008$2009$2010$
Percen
tage)of)H
arware)IDs)
Year)
Over 2100 Distinct JPEG Hardware IDs
![Page 20: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/20.jpg)
CONCLUSIONS Formats over Time
![Page 21: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/21.jpg)
Summary
Format obsolescence is complex Network effects do appear to stabilize formats But once popular formats are fading nevertheless More sophisticated approach required
Please re-use our data, or ask for more Firmer conclusions need:
Richer, more detailed results From a wider range of corpora
This approach only gives creator information A different approach will be needed to understand
resource consumption (e.g. PPT 4, RealAudio 1)
![Page 22: Formats Over Time: Exploring UK Web History](https://reader033.fdocuments.in/reader033/viewer/2022052618/54928680ac7959182e8b4643/html5/thumbnails/22.jpg)
webarchive.org.uk
Questions?