OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005...
Transcript of OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005...
![Page 1: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/1.jpg)
OverCite:A Cooperative Digital Research Library
Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David Karger,
Robert Morris, Scott Shenker
![Page 2: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/2.jpg)
OverCite, IPTPS 2005
Everyone Loves CiteSeer
• Online repository of academic papers• Crawls, indexes, links, and ranks papers• Important resource for CS community
Peer-to-peer wireless mesh hash tablesSelf-tuning multi-hop sub ringsFeasibility of peer-to-peer web indexingReliable web services
![Page 3: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/3.jpg)
OverCite, IPTPS 2005
Everyone Hates CiteSeer
• Burden of running the system forced on one site• New resource-heavy features difficult to support• Scalability to large document sets uncertain
![Page 4: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/4.jpg)
OverCite, IPTPS 2005
What Can We Do?
• Solution #1: All your © are belong to ACM• Solution #2: Donate money to PSU• Solution #3: Run your own mirror• Solution #4: Aggregate donated resources
![Page 5: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/5.jpg)
OverCite, IPTPS 2005
Solution: OverCite
• Distribute crawling, storage, queries via a DHT• Goal: Distribute work evenly among nodes• 100 nodes � 30x performance benefit
![Page 6: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/6.jpg)
OverCite, IPTPS 2005
CiteSeer TodaySearch keywords
Rank metrics
Results meta-data
![Page 7: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/7.jpg)
OverCite, IPTPS 2005
CiteSeer TodayCached
doc
Cited by
![Page 8: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/8.jpg)
OverCite, IPTPS 2005
CiteSeer Today: Local Resources
34.4 GB/dayTotal traffic
21 GB/dayDocument traffic
250,000/daySearches
829 GBTotal storage
18 GBIndex size
44 GBMeta-data storage
767 GBDocument storage
715,000# documents
![Page 9: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/9.jpg)
OverCite, IPTPS 2005
Challenges
• Distribute storage for parallel speedup• Replicate storage for availability• Parallelize query load for load-balancing• Distribute crawling for improved throughput
![Page 10: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/10.jpg)
OverCite, IPTPS 2005
OverCite Architecture
• Several hundred relatively stable nodes• Each node runs four separate components
OverCite Server
DHT Index
Web-based front endCrawler
![Page 11: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/11.jpg)
OverCite, IPTPS 2005
Documents and Meta-data• DHT stores papers for parallelism/availability• DHT stores meta-data tables
– e.g., document IDs � {title, author, year, etc.}• Use large-state DHT like OneHop [Gupta et al. NSDI ’04]
or Accordion [Li et al. NSDI ’05]
Server Server Server Server
![Page 12: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/12.jpg)
OverCite, IPTPS 2005
peer � {1,2,3}DHT � {4}mesh� {2,4,6}hash � {1,2,5}table � {1,2,4}
Index • Goal: Parallelize queries• Partition by document, not keyword• Divide the index into k partitions• Each query sent to only k nodes
Server Server Server Server
peer �{1}hash �{1,5}table �{1}
peer �{2}mesh�{2,6}hash �{2}table �{2}
peer �{3}DHT �{4}mesh �{4}table �{4}
Part. 1 Part. 1 Part. 2 Part. 2
peer �{1,2,3}mesh�{2}hash �{1,2}table �{1,2}
peer �{1,2,3}mesh�{2}hash �{1,2}table �{1,2}
DHT �{4}hash �{5}mesh�{4,6}table �{4}
DHT �{4}hash �{5}mesh�{4,6}table �{4}
Partition 1 Partition 2
![Page 13: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/13.jpg)
OverCite, IPTPS 2005
Anatomy of a Query
ClientQuery
ResultsPage
KeywordsTop hits w/
rank and contextMeta-data
req/resp
Partition 1
Partition 2Partition 3
Partition k
Document Req
Document
Documentblocks
Anatomy of a Download
![Page 14: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/14.jpg)
OverCite, IPTPS 2005
Properties of OverCite
(2/n)x21 GB/dayServingDocuments
(3.5/n GB)/day--Query bw(3/n)x829 GBStorage(3/n)x0.735 MB/docCrawlingOverCite (per node)CiteSeer TodayOperation
• Performance scales with n/3 system-wide
![Page 15: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/15.jpg)
OverCite, IPTPS 2005
CiteSeer Extensions
• Document alerts (e.g., SmartSeer)• Amazon-like recommendations• Plagiarism checking• Expand document collection
– Larger set of disciplines– Preprints and public reviews
![Page 16: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/16.jpg)
OverCite, IPTPS 2005
Related Work
• Search on DHTs– Partition by keyword
[Li et al. IPTPS ’03, Reynolds & Vadhat Middleware ’03, Suel et al. IWWD ’03]
– Hybrid schemes[Tang & Dwarkadas NSDI ’04, Loo et al. IPTPS ’04, Shi et al. IPTPS ‘04]
• Distributed crawlers[Loo et al. TR ’04, Cho & Garcia-Molina WWW ’02, Singh et al. SIGIR ‘03]
• Parallel speedup[Dean & Ghemawat OSDI ’04]
![Page 17: OverCite - pdos.csail.mit.edustrib/presentations/overcite-iptps05.pdf · OverCite, IPTPS 2005 Challenges • Distribute storage for parallel speedup • Replicate storage for availability](https://reader034.fdocuments.in/reader034/viewer/2022042307/5ed2cf79de6a9e578d6fa55c/html5/thumbnails/17.jpg)
OverCite, IPTPS 2005
Summary
• A design for storing and coordinating a digital repository using a DHT
• Spreads load across many volunteer nodes• Support for resource-intensive new features• Run CiteSeer as a community