Being Google

22
being google tom dyson

description

The elements of full text search.

Transcript of Being Google

Page 1: Being Google

being googletom dyson

Page 2: Being Google

V.

Page 3: Being Google

metadata is easy

Page 4: Being Google

language is hard

Page 5: Being Google

Our Corpus:1. The cow says moo.

2. The sheep says baa. 3. The dogs say woof.4. The dog-cow says

moof.

Page 6: Being Google

>>> doc1 = "The cow says moo.">>> doc2 = "The sheep says baa.">>> doc3 = "The dogs say woof.">>> doc4 = "The dog-cow says moof."

Page 7: Being Google

Brute force>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):... for doc in docs:... if doc.find(term) > -1:... print "found '%s' in '%s'" % (term, doc)... >>> searcher('moo')found 'moo' in 'The cow says moo.'

Page 8: Being Google

my first inverted index

Page 9: Being Google

Tokenising #1

>>> doc1.split()['The', 'cow', 'says', 'moo.']

Page 10: Being Google

Tokenising #2

>>> import re>>> word = re.compile('\W+')>>> word.split(doc1)['The', 'cow', 'says', 'moo', '']

>>> doc4 = "The dog-cow says moof">>> word.split(doc4)['The', 'dog', 'cow', 'says', 'moof']

Page 11: Being Google

Tokenising #3

>>> word = re.compile('\s|[^a-z-]', re.I)>>> word.split(doc4)['The', 'dog-cow', 'says', 'moof', '']

Page 12: Being Google

Data structures

>>> doc1 = {'name':'doc 1', 'content':"The cow says moo."}>>> doc2 = {'name':'doc 2', 'content':"The sheep says baa."}>>> doc3 = {'name':'doc 3', 'content':"The dogs say woof."}>>> doc4 = {'name':'doc 4', 'content':"The dog-cow says moof."}

Page 13: Being Google

Postings>>> postings = {}

>>> for doc in docs:... for token in word.split(doc['content']):... if len(token) == 0: break... doc_name = doc['name']... if token not in postings:... postings[token.lower()] = [doc_name]... else:... postings[token.lower()].append(doc_name)

Page 14: Being Google

Postings

>>> postings{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}

Page 15: Being Google

O(log n)>>> def searcher(term):... if term in postings:... for match in postings[term]:... print "found '%s' in '%s'" % (term, match)... >>> searcher('says')found 'says' in 'doc 1'found 'says' in 'doc 2'found 'says' in 'doc 4'

Page 16: Being Google

More postings

‘sheep’: [‘doc 2’, [2]]‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]

Page 17: Being Google

and more postings

‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]‘google’: [‘intro’, [‘field’: ‘title’], 2]]

Page 18: Being Google

tokenising #3Punctuation

StemmingStop words

Parts of SpeechEntity Extraction

Markup

Page 19: Being Google

LogisticsStorage

(serialising, transporting, clustering)

UpdatesWarming up

Page 20: Being Google

rankingDensity(tf–idf)

PositionDate

RelationshipsFeedbackEditorial

Page 21: Being Google

interesting searchLucene

(Hadoop, Solr, Nutch)OpenFTS / MySQL

SphinxHyper Estraier

XapianOther index types

Page 22: Being Google

being googletom dyson