Instant Indexing

Instant Indexing

Greg LindahlCTO, Blekko

October 21, 2010 - BCS Search Solutions 2010

Blekko Who?

• Founded in 2007, $24m in funding• Whole-web search engine• Currently in invite-only beta– 3B page crawl– innovative UI

• … but this talk is abut indexing

What whole-web search was

• Sort by relevance only• News and blog search done with separate

engines• Main index updated slowly with a batch

process• Months to weeks update cycle

What web-scale search is now

• Relevance and date sorting• Everything in a single index• Incremental updating• Live-crawled pages should appear in the main

index in seconds• All data stored as tables

Instant Search Indexing

• /date screnshot

Another Example

Google’s take on the issue

• Daniel Peng and Frank Dabek, Large Scale Incremental Processing Using Distributed Transactions and Notifications

• “Databases do not meet the storage or throughput requirements of these tasks… MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.”

Percolator details• ACID, with multi-row transactions• triggers ("observers"), can be cascaded• crawler is a cascade of triggers:– MapReduce writes new documents into bigtable– trigger parses and extracts links– cascaded trigger does clustering– cascaded trigger exports changed clusters– 10 triggers total in indexing system

• max 1 observer per column for complexity reasons• message collapsing when there are multiple updates

to a column

Blekko’s take on this

• We want to run the same code in a mapjob or in an incremental crawler/indexer

• Our bigtable-like thingie shouldn’t need a percolator-sized addition to do it

• Needs to be more efficient than other approaches

• OK with non-ACID, relaxed eventual consistentcy, etc

Combinators

• Task: gather incoming links and anchortext• Each crawled webpage has dozens of outlinks• Crawler wants to write into dozens of inlists,

each in a separate cell in a table• TopN combinator: list of N highest-ranked

items• If a cell is frequently written, writes can be

combined before hitting disk

Combining combinators

• Combine within the writing process• Combine within the local write daemon• Combine within the 3 disk daemons, and the ram

daemon– highly contented cells result in 1 disk transaction per 30

seconds

• Combinators are represented as strings and can be used without the database

• Using combinators seems to be a significant reduction of RPCs over Percolator, but I have no idea what the relative performance is.

TopN example

• table: /index/32/url row: pbm.com/~lindahl/ column: inlinks– a list of: rank, key, data– 1000, www.disney.com, “great website”– 540, britishmuseum.com/dance, “16th century

dance manuals in facsimile”– 1, www.ehow.com/dance, “renaissance dance”

http://www.disney.com/

http://www.ehow.com/dance

MapReduce from a combinator perspective

• MapReduce is really map, shuffle, reduce• input: a file/table, output: a file/table• An incremental job to do the same

MapReduce looks completely different; you have to implement the shuffle+reduce

• Could write into BigTable cells…

MapJobs+Combinators

• Map function runs on shards• All output is done by writing into a table, using

combinators• The same map function can also be run

incrementally on individual inputs• The shuffle+reduce is still there, it’s just done

by the database+combinators

Combinator types

• topN• lastN = topN, using time as the rank• sum, avg, eavg, min, max• counting things– logcount: +- 50% count of strings in 16 bytes

• set -- everything is a combinator

• Cells in our tables are native Perl/Python data structures

• hence: atomic updates on a sub-cell level

Combinators for indexing

• The basic data structure for search is the posting list:– for each term, a list with rows• docid, rank

• Sounds like a custom topN to us– rank = rank or date or …– lists heavily compressed

• Each posting list has N shards

Combinators for crawling

• Pick a site, crawl the most important uncrawled pages– that’s stored as a topN

• (the “livecrawl” uses other criteria)• Crawl, parse, and spew writes– outlinks into inlinks cells– page ip/geo into incoming ips, geos– page hashes into duptext detection table– count everything under the sun– 100s of writes total

Instant index step

• Crawler does the indexing• Decides which terms to index based on page

contents and incoming anchortext• Writes into posting lists– if indexed before, use list of previously indexed

terms to delete any obsolete terms• Heavily-contented posting lists are not a

problem due to combining -- that’s how a naked [/date] query works.

Supporting date queries

• /date queries fetch about 3X the posting lists of a relevance query

• to support [/health /date], we keep a posting list of the most recent dated pages for each website

• date needs some relevance; every date-sorted posting list has a companion date-sorted lists of only highly-relevant articles

Example: [obama /date]

• The term posting list for ‘obama’ has overflowed -- moderately relevant dated queries are probably smushed out

• The date posting list for ‘obama’ has overflowed

• The date posting list for highly-relevant dated ‘obama’ is not full

To Sum Up

• There’s more than one way to do it– yes, we use Perl

• I don’t think Blekko’s scheme is better or worse than Google’s, but at least it’s very different

• See me if you’d like an invite to our beta-test

Instant Indexing

Documents

Transcript of Instant Indexing