Instant Indexing
description
Transcript of Instant Indexing
Instant Indexing
Greg LindahlCTO, Blekko
October 21, 2010 - BCS Search Solutions 2010
Blekko Who?
• Founded in 2007, $24m in funding• Whole-web search engine• Currently in invite-only beta– 3B page crawl– innovative UI
• … but this talk is abut indexing
What whole-web search was
• Sort by relevance only• News and blog search done with separate
engines• Main index updated slowly with a batch
process• Months to weeks update cycle
What web-scale search is now
• Relevance and date sorting• Everything in a single index• Incremental updating• Live-crawled pages should appear in the main
index in seconds• All data stored as tables
Instant Search Indexing
• /date screnshot
Another Example
Google’s take on the issue
• Daniel Peng and Frank Dabek, Large Scale Incremental Processing Using Distributed Transactions and Notifications
• “Databases do not meet the storage or throughput requirements of these tasks… MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.”
Percolator details• ACID, with multi-row transactions• triggers ("observers"), can be cascaded• crawler is a cascade of triggers:– MapReduce writes new documents into bigtable– trigger parses and extracts links– cascaded trigger does clustering– cascaded trigger exports changed clusters– 10 triggers total in indexing system
• max 1 observer per column for complexity reasons• message collapsing when there are multiple updates
to a column
Blekko’s take on this
• We want to run the same code in a mapjob or in an incremental crawler/indexer
• Our bigtable-like thingie shouldn’t need a percolator-sized addition to do it
• Needs to be more efficient than other approaches
• OK with non-ACID, relaxed eventual consistentcy, etc
Combinators
• Task: gather incoming links and anchortext• Each crawled webpage has dozens of outlinks• Crawler wants to write into dozens of inlists,
each in a separate cell in a table• TopN combinator: list of N highest-ranked
items• If a cell is frequently written, writes can be
combined before hitting disk
Combining combinators
• Combine within the writing process• Combine within the local write daemon• Combine within the 3 disk daemons, and the ram
daemon– highly contented cells result in 1 disk transaction per 30
seconds
• Combinators are represented as strings and can be used without the database
• Using combinators seems to be a significant reduction of RPCs over Percolator, but I have no idea what the relative performance is.
TopN example
• table: /index/32/url row: pbm.com/~lindahl/ column: inlinks– a list of: rank, key, data– 1000, www.disney.com, “great website”– 540, britishmuseum.com/dance, “16th century
dance manuals in facsimile”– 1, www.ehow.com/dance, “renaissance dance”
MapReduce from a combinator perspective
• MapReduce is really map, shuffle, reduce• input: a file/table, output: a file/table• An incremental job to do the same
MapReduce looks completely different; you have to implement the shuffle+reduce
• Could write into BigTable cells…
MapJobs+Combinators
• Map function runs on shards• All output is done by writing into a table, using
combinators• The same map function can also be run
incrementally on individual inputs• The shuffle+reduce is still there, it’s just done
by the database+combinators
Combinator types
• topN• lastN = topN, using time as the rank• sum, avg, eavg, min, max• counting things– logcount: +- 50% count of strings in 16 bytes
• set -- everything is a combinator
• Cells in our tables are native Perl/Python data structures
• hence: atomic updates on a sub-cell level
Combinators for indexing
• The basic data structure for search is the posting list:– for each term, a list with rows• docid, rank
• Sounds like a custom topN to us– rank = rank or date or …– lists heavily compressed
• Each posting list has N shards
Combinators for crawling
• Pick a site, crawl the most important uncrawled pages– that’s stored as a topN
• (the “livecrawl” uses other criteria)• Crawl, parse, and spew writes– outlinks into inlinks cells– page ip/geo into incoming ips, geos– page hashes into duptext detection table– count everything under the sun– 100s of writes total
Instant index step
• Crawler does the indexing• Decides which terms to index based on page
contents and incoming anchortext• Writes into posting lists– if indexed before, use list of previously indexed
terms to delete any obsolete terms• Heavily-contented posting lists are not a
problem due to combining -- that’s how a naked [/date] query works.
Supporting date queries
• /date queries fetch about 3X the posting lists of a relevance query
• to support [/health /date], we keep a posting list of the most recent dated pages for each website
• date needs some relevance; every date-sorted posting list has a companion date-sorted lists of only highly-relevant articles
Example: [obama /date]
• The term posting list for ‘obama’ has overflowed -- moderately relevant dated queries are probably smushed out
• The date posting list for ‘obama’ has overflowed
• The date posting list for highly-relevant dated ‘obama’ is not full
To Sum Up
• There’s more than one way to do it– yes, we use Perl
• I don’t think Blekko’s scheme is better or worse than Google’s, but at least it’s very different
• See me if you’d like an invite to our beta-test