Being Google
-
Upload
tom-dyson -
Category
Technology
-
view
3.341 -
download
0
description
Transcript of Being Google
being googletom dyson
V.
metadata is easy
language is hard
Our Corpus:1. The cow says moo.
2. The sheep says baa. 3. The dogs say woof.4. The dog-cow says
moof.
>>> doc1 = "The cow says moo.">>> doc2 = "The sheep says baa.">>> doc3 = "The dogs say woof.">>> doc4 = "The dog-cow says moof."
Brute force>>> docs = [doc1, doc2, doc3, doc4]
>>> def searcher(term):... for doc in docs:... if doc.find(term) > -1:... print "found '%s' in '%s'" % (term, doc)... >>> searcher('moo')found 'moo' in 'The cow says moo.'
my first inverted index
Tokenising #1
>>> doc1.split()['The', 'cow', 'says', 'moo.']
Tokenising #2
>>> import re>>> word = re.compile('\W+')>>> word.split(doc1)['The', 'cow', 'says', 'moo', '']
>>> doc4 = "The dog-cow says moof">>> word.split(doc4)['The', 'dog', 'cow', 'says', 'moof']
Tokenising #3
>>> word = re.compile('\s|[^a-z-]', re.I)>>> word.split(doc4)['The', 'dog-cow', 'says', 'moof', '']
Data structures
>>> doc1 = {'name':'doc 1', 'content':"The cow says moo."}>>> doc2 = {'name':'doc 2', 'content':"The sheep says baa."}>>> doc3 = {'name':'doc 3', 'content':"The dogs say woof."}>>> doc4 = {'name':'doc 4', 'content':"The dog-cow says moof."}
Postings>>> postings = {}
>>> for doc in docs:... for token in word.split(doc['content']):... if len(token) == 0: break... doc_name = doc['name']... if token not in postings:... postings[token.lower()] = [doc_name]... else:... postings[token.lower()].append(doc_name)
Postings
>>> postings{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}
O(log n)>>> def searcher(term):... if term in postings:... for match in postings[term]:... print "found '%s' in '%s'" % (term, match)... >>> searcher('says')found 'says' in 'doc 1'found 'says' in 'doc 2'found 'says' in 'doc 4'
More postings
‘sheep’: [‘doc 2’, [2]]‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
and more postings
‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]‘google’: [‘intro’, [‘field’: ‘title’], 2]]
tokenising #3Punctuation
StemmingStop words
Parts of SpeechEntity Extraction
Markup
LogisticsStorage
(serialising, transporting, clustering)
UpdatesWarming up
rankingDensity(tf–idf)
PositionDate
RelationshipsFeedbackEditorial
interesting searchLucene
(Hadoop, Solr, Nutch)OpenFTS / MySQL
SphinxHyper Estraier
XapianOther index types
being googletom dyson