1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke...

1

WebBase : A repository of web pagesWebBase : A repository of web pages

Jun Hirai Sriram Raghavan Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas PaepckeHector Garcia-Molina Andreas Paepcke

Computer Science DepartmentComputer Science DepartmentStanford UniversityStanford University

By: Maria FragouliAthens 2002

2

Web repository: stores, manages large collections of web pages,is used by applications that access, mine or index up-to-date web content

Basic implementation goals:Scalability: use of network disks to hold the repository so that it can scale to web growth,Streams: support of streaming (ordered) access mode (cmp to random access mode) for requests of pages in bulk (cmp to individual pages requests)Large updates: new updated version of pages must efficiently replace older onesExpunging Pages: obsolete pages need to be detected and removed

3

We study:Repository architecture for required functionality – performanceDistribution policies of web pages across network disksInteraction of crawler-repositoryOrganization strategies of web pages on system nodesExperimental results of simulations on prototype

WebBase: prototype repository – Stanford University

Design Assumptions for the Web repositoryIncremental crawler: only new or changed web pages are visited at each runRetain only the latest version of each page.Crawl and store only HTML pagesSnapshot index construction

4

WebBase Architecture IFunctional modules and their interaction

5

WebBase Architecture IIFunctional modules and their interaction

Crawler module: retrieves new or updated copies of web pagesStorage module: assigns pages to storage devices,

handles updates of pages,

schedules, services requests, etc.Metadata-Indexing module: indexes pages and metadata extracted from themQuery engine:Multicast module:

handle web content according to access mode on pages

6

Access ModesRandom access: pages retrieved using their URLQuery-based access: pages retrieved as responses to queries on pages metadata or textual content (handled by query engine)Streaming access: pages retrieved and delivered as a data stream to requesting applications (handled by multicast module)

Streams available not only locally but to remote applications as wellRestartable streams, can be paused and resumed at will

Page IdentifierThe page URL is first normalized:

The resulting text string is hashed using a signature computation to yield a 64-bit page identifier (signature collisions unlikely to occur).

Removal of the protocol prefixRemoval of the port number specification Conversion of the server name to lower case Removal of all trailing slashes ("/")

7

Storage Manager (SM)Stores only latest versions of web pages – provides facilities for their access/updateConsistency of indexes must be dealt withExpunging of obsolete pages is assisted by the allowed lifetime and lifetime count values associated with each page

For scalability:SM is distributed across a collection of storage nodesStorage nodes are coordinated by a central node management serverThe latter keeps a table of parameters concerning current state of each storage node (node capacity, extent of node fragmentation, state, # of requests)

CrawlerCrawler

Stream requests

Random access requests

Node mgmt server

LAN

8

Design issues for SM – I. Page Distribution across nodes

Uniform distribution: all nodes are treated identicallyHash distribution: pages are stored on the nodes whose range of identifiers include the page identifier

Uniform vs Hash distribution

Requires global index (mapping of pageID ->nodeID)Simple node additionMore robust to failures

Sparse global index (fixed pageID-nodeID relationship)Need for “extensible hashing”Special recovery measures required

9

i. Hash-based organizationEach disk is considered as a collection of hash bucketsPages are stored into buckets according to the pageID range they holdBucket overflows are handled by allocation of extra overflow bucketsWe assume that

-buckets with successive ranges of pageIDs are physically continuous on disk,-pages are stored in the buckets in increasing order of their IDs

Design issues for SM – II. Organization of pages on disk

How the fundamental operations are performedRandom page access: identify containing bucket->read it into memory->main memory search to locate pageStreaming: sequentially read buckets into memory->transmit pages to clientPage addition: in-order or not in-memory addition of pages in buckets->disk write of modified buckets

10

ii.Log-based organization New pages received are appended at the end of the log

How the fundamental operations are performedRandom page access: requires two disk accessesStreaming: read sequentially the log for valid pages Page addition: pages are added to the log, catalog and B-tree modifications are periodically flushed to disk

LogAppend

pages

CatalogDiskBasic objects on disk:

Log: includes pages allocated at diskCatalog: contains entries with useful info (pageID, ptr to physical location of page in log, pagesize, pagestatus, timestamp of page addition) for each page in the logB-tree index in case of random access mode

11

Classification of pages in repository:Class A: includes old versions of pages that will be replacedClass B: unchanged pagesClass C: unseen pages or new versions of pages that will replace class A pages

General update process:Receive class C pages from the crawler and add them to the repository. Rebuild all the indexes using the class B and C pages. Delete the class A pages.

Suggested update strategies:i. Batch update ii. Incremental update

Design issues for SM – III. Update Schemes

12

i. Batch update schemeTwo sets of storage nodes: update nodes (hold class C pages), read nodes (hold class A, B pages)Steps followed:System isolation Page transfer System restart

13

Examples of page transfer in the batch update scheme

1. Log-structured page organization and Hash distribution policy on both sets of nodes

Deletion of class A pages requires a separate step

CrawlerCrawler

4 Update nodes

12 Read nodes

...Transmission

of class C pages streams

Distribution of pages by their pID

14

2. Hash-based page organization and Hash distribution policy on both setsDeletion of class A pages occurs while class C pages are addedThis addition is performed using merge sort

Advantages: no conflicts occur, physical location of pages is not changed (compaction operation=part of the update)

15

ii. Incremental update scheme

All nodes are equally responsible for supporting both page update and access at the same time continuous service provision

Drawbacks of continuous servicePerformance penalty: due to conflicts between various operationsRequirement for maintaining local index in a dynamic wayRestartable streams are more complicated

-in batch update systems, the pair (Node-id, Page-id), provides sufficient information for their state-in incremental update systems where physical locations of pages may change, additional stream state information is required

16

ExperimentsWebBase prototype SM’s configuration features

Batch update strategyHash page distribution for both update and read nodesLog-structured page organization in both sets of nodes

Implemented on top of a standard Linux FSSM is fed 50-100 pages/sec from an incremental crawlerUse of a cluster of PCs connected by a 100 Mbps Ethernet LANA client module to request access on the repository and a crawler emulator to retrieve/transmit pages accordingly are also implemented

Performance Metrics Page addition rate (pages/sec/node)Streaming rate (pages/sec/node)Random access rate (pages/sec/node)Batch update time (in case of batch update systems)

Batch[U(hash,log),R(hash, log)]

17

-if on the average 16 pages are kept per bucket, a hash bucket size of 64 KB must be chosen and thus the average random page access time would be 20,7 ms (optimal point, plot A)-As buckets grow, space utilization and streaming performance improve, but random access suffers

Optimal hash bucket size Space-performance tradeoff

Choosing a hash bucket size

18

Hashed-log hybrid node organization: the disk contains a number of large logs (8-10MB), each one associated with a range of hash values

Comparing different systems

Performance Metric Log-structured (pages/sec)

Hash-based (pages/sec)

Hashed-log(pages/sec)

Streaming rate and ordering 6300 unsorted 3900 sorted 6300 sorted

Random page access rate 35 51 35

Page addition rate(random order, no buffering)

6100 23 53

Page addition rate(random order, 10MB buffer)

6100 35 660

Page addition rate(sorted order, 10MB buffer)

6100 1300 1300

19

System configuration Page addition rate

[pages/sec/node]

Batch update time(update ratio=0.25)

Batch[U(hash, log), R(hash, hash)] 6100 11700 secs

Batch[U(hash, hash), R(hash, hash)] 35 1260 secs

Batch[U(hash, hashed-log), R(hash, hash)]

660 1260 secs

Assumption:25% of the pages on read nodes are replaced by newer versions during the update process (update ratio=0.25)

Comparing different configurations

20

Experiments onoverall system performance of prototype

Performance Metric Observed value

Streaming rate 2800 pages/sec (per read node)

Page addition rate 3200 pages/sec (per update node)

Batch update time 2451 seconds (for update ratio = 0.25)

Random page access rate 33 pages/sec (per read node)

Batch update time of prototype

and

Performance of prototype

21

System configuration Stream Random access

Page addition

Update time

Incr [hash, log] + - -- inapplicable

Incr [uniform, log] + -- + inapplicable

Incr [hash, hash] + + - inapplicable

Batch [U(hash, log), R(hash, log)]

++ - ++ +-

Batch [U(hash, log), R(hash, hash)]

+ + ++ --

Batch [U(hash, hash), R(hash, hash)]

+ + - +

Batch [U(hashed-log, hash), R(hash, hash)]

+ + +- +

Summary - Relative performance of different system configurations

Ordering of symbols adopted from the most to the least favorable: ++,+,+-,-,--

22

We provided overview of:WebBase prototype architecturePerformance metrics based on simulation experimentsWebBase being considered as a research test-bed for various system configurations

Future enhancements on WebBase include:Implementation of advanced system configurationsDevelopment of advanced streaming facilities (e.g. deliver streams for subsets of web pages on repository)Integration of a history maintaining service for old-replaced web pages

Conclusions

1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke...

Documents

Transcript of 1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke...