Download - Searching the Clouds Presented by Kajal Miyan Slides courtesy: UC Berkeley RAD Lab

Searching the Clouds

Presented by Kajal Miyan

Slides courtesy:

UC Berkeley RAD Lab

http://cis.poly.edu/westlab/odissea/

Above the Clouds

A Berkeley View of Cloud Computing

Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia

Seminar "Peer-to-peer Information Systems" 3

Outline

What is it? Why now? Cloud killer apps Economics Challenges and opportunities Implications ODESSIA


What is Cloud Computing?

Old idea: Software as a Service (SaaS) Def: delivering applications over the Internet

Recently: “[Hardware, Infrastructure, Platform] as a service” Poorly defined so we avoid all “X as a service”

Utility Computing: pay-as-you-go computing Illusion of infinite resources No up-front cost Fine-grained billing (e.g. hourly)


Why Now?

Experience with very large datacenters Advent of technologies like Web 2.0 and Google

Adsense Other factors

Pervasive broadband Internet Pay-as-you-go billing model Standard software stack (Amazon VM) Gray's analysis: keep the data near the application


Spectrum of Clouds

Instruction Set VM (Amazon EC2, 3Tera) Bytecode VM (Microsoft Azure) Framework VM

Google AppEngine, Force.com

Lower-level,Less management

Higher-level,More management

EC2 Azure AppEngine Force.com


Cloud Killer Apps

Interactive Real Time applications Rise of analytics Extensions of desktop software

Matlab, Mathematica Parallel Batch processing

Oracle at Harvard, Hadoop at NY Times


Economics of Cloud Users

• Pay by use instead of provisioning for peak

Static Datacenter DataCenter in Cloud

Demand

Capacity

Time

Demand

Capacity

Time


Adoption Challenges

Challenge Opportunity

Availability (DDos) Multiple providers & DCs

Data lock-in Standardization

Data Confidentiality and Auditability

Encryption, VLANs, Firewalls; Geographical Data Storage


Growth Challenges


Data transfer bottlenecks

FedEx-ing disks, Data Backup/Archival

Performance unpredictability

Improved VM support, flash memory, scheduling VMs

Scalable storage Invent scalable store

Bugs in large distributed systems

Invent Debugger that relies on Distributed VMs

Scaling quickly Invent Auto-Scaler that relies on ML; Snapshots


Policy and Business Challenges


Reputation Fate Sharing Offer reputation-guarding services like those for email

Software Licensing Pay-for-use licenses; Bulk use sales


Implications

Startups and prototyping Cost associativity for scientific applications Research at scale

ODISSEAopen distributed search engine architecture

A Peer-to-Peer Architecture for Scalable Web Search and

Information Retrieval

Torsten Suel, Chandan Mathur, Jo-Wen Wu, Jiangong Zhang,

Alex Delis, Mehdi Kharrazi, Xiaohui Long, Kulesh Shanmugasundaram


“If Distributed Systems can be used to search for aliens then why not for Words???” :D


Search Engine Revision

Your Browser

The Web

URL1

URL2

URL3 URL4

Crawler

Indexer

SearchEngine

Database Eggs?Eggs.

Eggs - 90%Eggo - 81%Ego- 40%

Huh? - 10%

All AboutEggsby

S. I. Am


Motivation Today, main part of the web search infrastructure is supplied by

only a few large crawl-based search engines Strong research in the field of P2P systems over the last few

years This raises two issues

Vast data in P2P networks require the ability to search in these networks Significant computing resources provided by a P2P system could be used to search content residing inside or outside the system

ODISSEA - distributed global indexing and query execution service

image/svg+xml


Design Overview

ODISSEA is different from many other approaches to P2P search

It assumes a two-layered search engine architecture

It has a global index structure distributed over the nodes of the system (In a global index, as contradiction to a local index, a single node holds the entire inverted index for a particular term)


Two Layer Approach Lower layer provides

maintanance of the global index structure under document insertions and updates

Maintanance of node joins and failures Efficient execution of simple search

queries`

ODISSEA

queries

crawler

queries

Search server

Upper layer interacts with P2P-based lower layer via two classes of clients Update clients (e.g crawler, web server) Query clients (user implemented

optimized query execution plan)

WWW


Target Applications

Full-text search in large document collections located within in P2P communities

Search in large intranet environments Web Search: a powerful API supports the anticipated shift

towards client-based search tools which better exploit the resources of todays desktop machines


Two Layer Approach

Enables a large variety of (client-based) search tools that more fully exploit client computing resources.

Those tools could share the same lower-layer web search infrastructure.

Tools are developed using an open API, which accesses the search infrastructure

When processing a query, this could in the most general case (i.e where no pre-evaluation is done on server-side) result in large amounts of data to be transferred to the query client


Global vs. Local index posting = [DocID, Position, additional information] Local Index: each node creates its own index for all docs

that are locally stored Gobal Index: each node will hold a complete global

postings list for a certain group of words Suppose a query „chair AND table“. Then the query will be

processed as follows:


Global vs. Local index Local index organization is very inefficient in very large

networks (e.g. web) if result quality is the major concern, because the query has to be transmitted to all nodes and all of them have to respond (as data is unclustered)

But in a global index organization large amounts of data need to be transmitted between nodes when Initially building the index (adding new nodes) Evaluating a query bad response time

Can be overcome with smart algorithmic techniques Choice depends on the types of queries and the frequency of

document updates, as well as on the question of how dynamic the system is


Crawling and Fault Tolerance

Crawling approach Non P2P crawlers have the advantage that they can be easily

altered in the case that some web site operators have complains about the bot

Smart crawling strategies beyond BFS are hard to implement in a P2P environment unless there is a centralized scheduler

P2P systems and fault tolerance System design relies on the assumption of a more stable P2P

environment, since otherwise administration (insert, update, replication) would be too expencive


Implementation Details Currently, implemented in Java, using Pastry as a

P2P substrate (lower layer) and a DHT mapping for hashing IDs to the appropriate IP-address (Pastry is an overlay and routing network for the implementation of a DHT. The key-value pairs are stored in a redundant p2p network of connected Internet hosts.)

Each node runs an indexer that stores inverted list in compressed form in a Berkeley DB

Using MD5, all documents and term lists are hashed to a 80-bit ID that is used for lookups in the system


Implementation Details

Parsing and Routing Postings New or updated documents are parsed at the node where they

reside, as determined by the DHT mapping Parser generates for each term a posting that is routed via

several intermediate nodes, as determined by the topology of the Pastry network, until it reaches its destination node

An index structure of a node is split up in a small structure (residing in main memory) that is eventually merged with a bigger structure on disk to avoid disk accesses during inserts/updates lower amortized complexity



Groups and Splits Initially, all objects (documents, indexes) whose first w bits

(here w=16) coincide are placed into a common group identified by this w-bit string

Locally, each group maintains a Berkeley DB with all objects it contains

When a group (of documents) becomes too large (here >1GB), it is split into two groups identified by a (w+1)-bit string leaving a stub structure pointing to the new groups that are assigned to new nodes

If index structures for terms are too large (here >100MB), they are split into two lists according to the document IDs they contain



Replication Performed at group level by attaching „/0“, „/1“, etc. to the

group label (e.g. 0100101/2) This new label is then what is really presented to Pastry/DHT

during lookups All replicas of a group form a „clique“ that communicate

periodically to update their status If a group replica fails, the others are in charge of detecting

this and if necessary perform repair Each node can contain several distinct group replicas and

therefore participate in several cliques Postings are first routed to only one replica that is then in

charge of forwarding them to the others over a period of a few minutes



Faults, Unavailability and Synchronization When a node leaves the system, its group replicas eventually

have to be replaced to maintain the desired degree of replication

A node has failed if it has been unavailable for an extended period of time

Create new replicas for a failed node or if a certain number of nodes are unavailable

Former unavailable nodes have to synchronize its index structures using logs of missing updates


Efficient Query Processing

Information Thoeretic Background Let d be a document, q = q0…qm-1 a query consisting of m terms and

F be a function that assigns d (depending on q) a value F(d,q). Such a function is called a ranking function.

The top-k ranking problem for a query q is finding the k documents with the highest values F(d,q).

A common form of such a function looks like this

Since queries typically have at most only 2 search terms, the following algorithm focuses on the top-k ranking problem and queries with exactly 2 search terms (for one-term queries, there is in fact nothing to do)

1

0

),(),(m

iiqdfqdF



Fagin‘s Algorithm (FA) Intuitively, an item that is ranked in the top is likely to be

ranked very high in at least one of the contributing subcategories

Assume a query q = q0 AND q1 and postings of the form (d,f(d,qi)) that are sorted by the second component with highest values on top

Also assume that the inverted lists for q0 and q1 are located on the same machine, so that no network communication is required

Goal: compute the top k documents as fast as possible



0.9

1 2 43

86 5 1

1. Scan both lists from the beginning, by reading one element from each list in every step, until there are k documents that have been encountered in both lists (here assume k=2)

1. Compute the scores of these k documents. Also, for each document that was encoutered in only one of the lists, perform a lookup into the other list to determine the score of the document.

1. Return the k documents with the highest score (here d1, d5)

5

3 7

A

B

0.8 0.7 0.69 0.67

0.6 0.5 0.4 0.3 0.2 0.1



Threshold Algorithm (TA) Scan both lists simultaneously and read (d,f(d,q0)) from the

first and (d‘,f(d‘,q1)) from the second list Compute t = f(d,q0) + f(d‘,q1) For each d in one of the lists perform immediately a lookup in

the other list in order to compute its complete score Algorithm terminates, when k documents have been found

that have higher scores than the current value of t

Because it does not make sense to scan two lists simultaneously while they are distributed in a P2P network, the above techniques have to be adapted. This leads us to the following protocol that aims at minimize the data to be transferred.



A simple distributed pruning protocol (DPP)

A B

Node A (holding the shorter list) sends the first x postings to node B. Let rmin be the smallest value f(d,q0) transmitted

Node B receives the postings from A and performs a lookup into its own list in order to compute the total scores. Retain the k documents with the highest scores. Let rk be the smallest value among these.

Node B now transmitts to A all postings among its first x postings with f(d,q1) > rk - rmin, together with the total scores of the k documents from the previous step

Node A now performs lookups into its own list for the postings received from B and determines the overall top k documents



DPP-Example for k=2 and x=3:A containing term q0: (d1, 0.9), (d2, 0.8), (d3, 0.7), (d4, 0.69), (d5, 0.67)

B containing term q1: (d6, 0.6), (d5, 0.5), (d3, 0.4), (d1, 0.3), (d7, 0.2), (d8, 0.1)

A B

A to B: (d1, 0.9), (d2, 0.8), (d3, 0.7)

B computes:

(d1, 0.9 + 0.3)(d2, 0.8 + ----)(d3, 0.7 + 0.4)

B to A: (d6, 0.6), (d5, 0.5), because f(d6,5,q1) > 0.4 together with (d1, 1.2), (d3, 1.1)

A computes:

(d6, 0.6+ ---- ),(d5, 0.5+0.67),

rmin = 0.7

rk = 1.1rk – rmin = 1.1 - 0.7 = 0.4

Top 2 documents:

1. (d1, 1.2)2. (d5, 1.17)



Problems with the DPP works only with queries containing 2 search terms random lookups can cause disk accesses, since large index

structures reside on hard disk bad response time How must the value of x be chosen?

(x should be the number of postings transmitted from A and B, s.t. DPP works correct without extra roundtrip; depends on the k and length of the inverted lists) By deriving appropriate formulae based on extensive

testing By sampling-based methods that estimate the number of

documents appearing in both lists


Experimental Results



Evaluation of DPP 900 two-term queries selected form a set of over 1 million Testing corpora: 120 million web pages (1.8TB) that were

crawled by their own crawler Value of x determined by experiments on TA Computation within nodes are not taken into account Commmunication costs and estimated times of DPP for the

top-10 documents and standard cosine measure:shortest 20% shorter 20% middle 20% longer 20% longest 20%

Shorter lists 10.401 63.853 222.948 666.717 3.371.176 # postings A B 2.057 4.083 2.904 4.417 3.745 # postings B A 1.486 4.084 2.891 4.413 3.745 Total bytes transferred 28.344 65.336 46.360 70.640 59.920 Total com time (400Kbps) 1.052 1.477 1.216 1.550 1.405 Total com time (2Mbps) 833 1.368 1.107 1.441 1.295


Future Work

Bloom Filters New algorithmic techniques for the index

synchronization problem New strategies for load balancing and

rebuilding of lost replicas More experimental evaluation concerning

different types of queries


Questions?

Can we use this architecture to solve our Hardware and processing problems?

How much data and programming parallalization will be needed to make this possible?