1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

25
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006

Transcript of 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

Page 1: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

1

Archiving and Preserving the WebKristine Hanna

Internet Archive

April 2006

Page 2: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

2

Internet Archive Universal Access to Human Knowledge

• a 501(c)(3) non-profit

• Located in Presidio, San Francisco California

• Founded in 1996 to build an ‘Internet library’

• Provide permanent access for researchers, historians, and scholars to historical collections that exist in digital format.

• Built on open source principles

• Open Source software developed by Internet Archive and the IIPC

Page 3: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

3

Internet Archive Stats

• Largest public web archive• 60 billion pages, 55 million sites• Have expanded to include texts, audio, moving

images, and software: 2.6 million downloads a day

• 60,000 unique users a day

Page 4: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

4

What do we collect?Web Archive

• Take a broad snapshot of the web every 2 months • 2 billion pages a month• Websites from every domain (.org, .com, .edu etc)• Content in 21 languages• Entire archive accessible for free to the public via

the website at www.archive.org

Page 5: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

5

Why try to collect and preserve it all?

• Web has no boundaries, no limits• What will be important to future generations?• What is there today may be gone tomorrow

– “Capture now, ask why later”– “Grab it while you can, work it out later”– “Lose as little as possible”

Page 6: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

6

Open Source Technology primarily developed by Internet Archive and IIPC

• Heritrix: web crawler• Wayback Machine: access tool for rendering and

viewing files• Nutch and Nutchwax: Search engine• Arc File: archival record format (ISO work item)

How do we collect it?

Page 7: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

7

Wayback Machine

Page 8: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

8

Preservation

• Store multiple copies of each Archive

• 1300 machines/servers

• Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam)

• Standard storage boxes, open source design

Page 9: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

9

Archiving Next Steps

Institutions:• need to create collections around web

material • want to dig deeper in crawls for their

specific websites. • Want more control and access• want a technology partner that could harvest,

index, access, store and preserve their collections for them.

Page 10: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

10

• In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia, France and Italy

– Dedicated Crawl Engineer

- Customized crawling

• Library of Congress collections: (sample)

• Iraq War: 450 Million documents and growing

• 2004: U.S. National Elections: 88 Million documents

• Supreme Court Nomination 2005: 100 Million documents

1. Partner Contract Crawls

Page 11: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

11

• Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions:– develop an application for smaller institutions, that

have some resource constraints– A web based service that allows partners to create,

manage, search and store their web archives – User friendly web interface– Does not require technical expertise or infrastructure

• Pilot launched in September 2005

2. Archive-It

Page 12: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

12

Pilot Partners

• Center for Research Libraries• Research Libraries Group ( U of Toronto, U of Indiana,

Haverford and Swarthmore Colleges, IISH)• University of Texas• Library of Virginia• State Archives South Dakota• State Archives North Carolina• State Archives Alabama• Minnesota Historical Society• Institut d'Etude Politique de Grenoble

Page 13: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

13

Archive-It Access

• All collections are accessible for free to the general public, with text search, at:– www.archiveit. org– Partners websites with links

• Plus, member web application with login

Page 14: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

14

Screen shot here

• Public site

Page 15: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

15

Test Drive the Application

Page 16: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Page 17: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Page 18: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

18

Page 19: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

19

Screen shots here

• Monitor page

• Reports page

• XML feed

Page 20: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

20

Page 21: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

• Search– Your archived web pages are searchable by text or

URL

Page 22: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

22

• Stored Online

• We provide copies of the files in a hard drive that we can ship to your institution up to 2x a year

Page 23: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

23

Archive-It Releases

• 1.0 (February 8)

• 1.5 (April 19)

• 2.0 (July 29)

Page 24: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

24

Challenges we face

• Making the collections useful for a variety of end users (i.e. general public, researchers)

• Making sure we capture the best and most relevant content

• Continuing to develop our tools for access and harvesting (crawler.archive.org)

Page 25: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

25

Internet Archive’s priorities

• Collaboration and Partnerships

– Continue to act as a technology partner in providing web archiving services to government and memory institutions

– Continue to develop Open Source software

– Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium)

– Open Content Alliance (OCA) digital books project

• Multiple copies across the world

– Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria