ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J....

22
ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam CIS Department Polytechnic University Brooklyn, NY 11201 http://cis.poly.edu/westlab/odissea/ (google: “odissea peer”)
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    3

Transcript of ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J....

Page 1: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

ODISSEA: a Peer-to-Peer Architecturefor Scalable Web Search and IR

Torsten Suel

with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam

CIS DepartmentPolytechnic UniversityBrooklyn, NY 11201

http://cis.poly.edu/westlab/odissea/(google: “odissea peer”)

Page 2: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• ODISSEA: architecture, motivation, ideology - system design

- discussion of design choices

- our vision: open distributed web search architecture

• Distributed query processing - query execution in large search engines

- efficient distributed top-k queries

- experimental results

• Open problems and future work

Talk Outline:

Page 3: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• huge amount of work on web search• huge amount of activity in P2P• so, how about P2P (full text) search? - to query content in P2P networks

- to query content located outside P2P network

• current engines based on scalable PC clusters • so are many other “giant scale services”• we know how to do file sharing in P2P• how about search engines and large-scale IR?

Introduction:

Page 4: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

“Open DIStributed Search Engine Architecture”

• global indexing and query execution service - scalable to size of the web - scalable to large query load - highly robust - open

ODISSEA:

Page 5: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• avoids broadcasting query to all nodes• faces other problems: updates, long inverted lists• our main technical focus: efficient top-k queries

Global index organization:

local index organization global index organization

Page 6: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• scalable lower tier for indexing and query execution• crawling outside system• open interface supporting client-based tools

Two-tier architecture:

Page 7: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• search of content located in P2P network

• distributed search in large organizations

• as a large-scale web search engine

• as global search middleware on top of system of local index structures

Applications:

Page 8: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• beyond current web search: - smart desktop-based search tools - browsing assistants, navigational toolbars - access lower-level search infrastructure

• can we have a common infrastructure? - open - scalable - agnostic

• example: Google API (not really)• discussion: “entry barrier to search”• tradeoff/challenge: performance vs. flexibility

Vision: open web search infrastructure

Page 9: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• P2P system spectrum: - unstructured (Gnutella etc) vs. structured (DHT) - rapidly evolving vs. fairly static

• massive data apps = fairly static system? - limit to how fast we can move data around - exception: file sharing (download, then share)

• we are at the more stable end of spectrum

• failures vs. unavailability

• replication and synchronization challenges

Discussion: P2P and massive data

Page 10: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• based on Pastry DHT• index and objects stored in Berkeley DB• fine-grained postings traffic via P2P links• replication for fault-tolerance• replication based on “object groups”• nodes may be temporarily unavailable• synchronization of nodes upon reentry

Implementation:

Page 11: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• inverted index - a data structure for supporting text queries - like index in a book

Query processing in search engines

inverted index

aalborg 3452, 11437, ….......arm 4, 19, 29, 98, 143, ...armada 145, 457, 789, ...armadillo 678, 2134, 3970, ...armani 90, 256, 372, 511, ........zz 602, 1189, 3209, ...

disks with documents

indexing

Boolean queries: (zebra AND armadillo) OR armani

unions/intersections of lists

Page 12: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• scoring function: assigns score to each document with respect to a given query

• top-k queries: return k documents with highest score

• example cosine measure

Ranking in search engines:

• term-based vs. link-based ranking

• many other important factors (links, user feedback, $, markup)

Page 13: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• how to combine/add pagerank score and cosine? (addition)• use PR or log(PR) ?• normalize using mean of top-100 in list (Richardson/Domingo)

Using Pagerank in ranking:

Page 14: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• recent work by Fagin and others

• FA (Fagin’s algorithm), TA (Threshold algorithm), others

• term-based ranking: presort each list by contribution to cosine

Efficient algorithms for top-k queries:

• Pagerank: (pre)sort by combination of cosine and Pagerank?

Page 15: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• centralized setting• 120 million crawled pages• Excite query trace • CA = “clairvoyant algorithm”

Some results:

Page 16: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• most savings for long lists• in fact, cos + log(PR) schemes get better and better

More details:

Page 17: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• some methods increase with length of other list• intersection pretty bad

Shortest shorter lists:

Page 18: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• only FA with cosine increases with length of longer list• others much better and closer to each other

Medium shorter lists:

Page 19: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• one round-trip• need to decide right length of prefix to send

• can be extended to more than two keywords

Distributed implementation:

Page 20: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• top-10 queries• cosine (top) and cos + log(PR) (bottom)• 8 bytes per posting• TCP performance model for congestion window• prefix length determined by threshold algorithm (TA)

Results of distributed implementation:

Page 21: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

• P2P search: JXTA, pSearch, FASD, planetP, others

• with global index structure: - Gnawali (Chord)

- Reynolds/Vahdat: Bloom filters

- Li et al: feasibility of P2P search engines, Bloom filters and other techniques (IPTPS 2003)

• Pruning techniques for top-k queries - DB Community: Fagin et al. 1996 - now

- IR Community: since 1980s (Buckley/Lewit SIGIR 85)

- Persin/Zobel/Sacks-Davis 1996, Anh/Kretser/Moffat 2001

- differences: random lookups, # of terms, AND vs. OR

Related Work:

Page 22: ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

Current Status and Future Work:

• system still being built (very basic version done)

• working on query optimization - integrating Bloom filters and other heuristics - optimizing query plans for 2 and more keywords - use of statistics

• loose ends in evaluation - results for three and more terms - integrating other measures (e.g., term distance)

• replication, synchronization

more info: http://cis.poly.edu/westlab/odissea/ (google: “odissea peer”)