Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.

14
Web Archiving and Access Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park

Transcript of Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.

Web Archiving and AccessWeb Archiving and Access

Mike SmorulJoseph JaJa

ADAPT GroupUniversity of Maryland, College

Park

Web Archive Storage and Web Archive Storage and SearchSearch

• Management tools

• Storage infrastructure

• Indexing, searching, compression experiments

7/21/2010 NDIIPP Partners Meeting 2

Webarc ManagerWebarc Manager

• Develop a tool to help manage webarc collections

• Show statistics of a series of crawls

• Open API to easily query collection– List all copies of a page, etc

7/21/2010 NDIIPP Partners Meeting 3

Manager ComponentsManager Components

• WarcManager (server)– REST-based access to index– Index of DAT/ARC entries– URL Searching, ARC browsing,

• Javascript Client• Simple Web-Accessible Preservation

(SWAP)– Web-accessible distributed storage– ARC page retrieval– 1Gbps, 2200requests/s

7/21/2010 NDIIPP Partners Meeting 4

Manager DesignManager Design

7/21/2010 NDIIPP Partners Meeting 5

Javascript Client

Webarc Manager

MySQL

Storage Server

Storage Server

RESTJSON

Page Request

Manager ScreenshotsManager Screenshots

7/21/2010 NDIIPP Partners Meeting 6

Search ResultsSearch ResultsURL DetailsURL DetailsWARC File DetailsWARC File Details

Storage DesignStorage Design

• SWAP – Simple, Web-Accessible Preservation• Intelligent placement of files across multiple

servers and disk partitions• Simple HTTP access, PUT, GET, DELETE• Use redirects to provide a uniform namespace• Files organized into file groups

– Each group resides on multiple partitions (slices)– Hash(file_path) % slices = partition

• No centralized catalog

7/21/2010 NDIIPP Partners Meeting 7

How it worksHow it works

7/21/2010 NDIIPP Partners Meeting 8

Client

Server 1 Server 2

P1 P2 P4P3

GET ‘bag1/warc55.gz’

Calculate hash(bag1/warc55.gz) % 4 = 3

Return 302: Server 2 GET ‘bag1/warc55.gz’Return warc55.gz

PerformancePerformance

• Good small file and large file performance– Over 2000 requests/s and 3000 redirects/

7/21/2010 NDIIPP Partners Meeting 9

Time Machine for the Web

• Fast parallel indexer to handle large scale crawled web contents, coupled with a new compression scheme.

• Fast search of contents based on unstructured queries involving temporal specifications.

• Presentation of pertinent summary information in ranked order according to the temporal context.

7/21/2010 NDIIPP Partners Meeting 10

Parallel Index ConstructionParallel Index Construction

• A parallel and hybrid strategy of multi-core CPU and multi-GPU

• Processing Speed with a single node: 80MB/s

7/21/2010 NDIIPP Partners Meeting 11

DiskDisk

ARC File

CPU ParserCPU Parser

CPU ParserCPU Parser

CPU ParserCPU Parser

CPU ParserCPU Parser

CPU ParserCPU Parser

CPU ParserCPU Parser

DiskDisk

Dictionary &Postings Lists

Parallel CPU Parser

CPU IndexerCPU Indexer

CPU IndexerCPU Indexer

GPU IndexerGPU Indexer

GPU IndexerGPU Indexer

Parallel CPU/GPU Indexer

Index CompressionIndex Compression

• A combination of – integer compression scheme and– multi-versioned document posting w/

temporal information included• Potential space savings of temporal

information for frequent terms compared to full temporal information: 50%.

• Potential space saving from integer compression: 4B vs 1B for small integers

7/21/2010 NDIIPP Partners Meeting 12

Temporally Anchored Temporally Anchored Information RetrievalInformation Retrieval

• Given a temporally anchored query, return a ranked set of pages within a temporal context.

• A new approach that:– Substantially limits the search space– Generates efficiently temporally

ranked pages

• Extensive experimental results.7/21/2010 NDIIPP Partners Meeting 13

Additional InformationAdditional Information

• http://adapt.umiacs.umd.edu– Papers, results, etc..

• E-mail: [email protected]

7/21/2010 NDIIPP Partners Meeting 14