Being Google

being googletom dyson

metadata is easy

language is hard

Our Corpus:1. The cow says moo.

2. The sheep says baa. 3. The dogs say woof.4. The dog-cow says

moof.

>>> doc1 = "The cow says moo.">>> doc2 = "The sheep says baa.">>> doc3 = "The dogs say woof.">>> doc4 = "The dog-cow says moof."

Brute force>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):... for doc in docs:... if doc.find(term) > -1:... print "found '%s' in '%s'" % (term, doc)... >>> searcher('moo')found 'moo' in 'The cow says moo.'

my first inverted index

Tokenising #1

>>> doc1.split()['The', 'cow', 'says', 'moo.']

Tokenising #2

>>> import re>>> word = re.compile('\W+')>>> word.split(doc1)['The', 'cow', 'says', 'moo', '']

>>> doc4 = "The dog-cow says moof">>> word.split(doc4)['The', 'dog', 'cow', 'says', 'moof']

Tokenising #3

>>> word = re.compile('\s|[^a-z-]', re.I)>>> word.split(doc4)['The', 'dog-cow', 'says', 'moof', '']

Data structures

>>> doc1 = {'name':'doc 1', 'content':"The cow says moo."}>>> doc2 = {'name':'doc 2', 'content':"The sheep says baa."}>>> doc3 = {'name':'doc 3', 'content':"The dogs say woof."}>>> doc4 = {'name':'doc 4', 'content':"The dog-cow says moof."}

Postings>>> postings = {}

>>> for doc in docs:... for token in word.split(doc['content']):... if len(token) == 0: break... doc_name = doc['name']... if token not in postings:... postings[token.lower()] = [doc_name]... else:... postings[token.lower()].append(doc_name)

Postings

>>> postings{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}

O(log n)>>> def searcher(term):... if term in postings:... for match in postings[term]:... print "found '%s' in '%s'" % (term, match)... >>> searcher('says')found 'says' in 'doc 1'found 'says' in 'doc 2'found 'says' in 'doc 4'

More postings

‘sheep’: [‘doc 2’, [2]]‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]

and more postings

‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]‘google’: [‘intro’, [‘field’: ‘title’], 2]]

tokenising #3Punctuation

StemmingStop words

Parts of SpeechEntity Extraction

Markup

LogisticsStorage

(serialising, transporting, clustering)

UpdatesWarming up

rankingDensity(tf–idf)

PositionDate

RelationshipsFeedbackEditorial

interesting searchLucene

(Hadoop, Solr, Nutch)OpenFTS / MySQL

SphinxHyper Estraier

XapianOther index types

being googletom dyson

Being Google

Technology

Transcript of Being Google