Building A Mini Google High Performance Computing In Ruby Presentation 1
-
Upload
elliando-dias -
Category
Technology
-
view
974 -
download
0
Transcript of Building A Mini Google High Performance Computing In Ruby Presentation 1
![Page 1: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/1.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Building Mini‐Google in Ruby
Ilya Grigorik @igrigorik
![Page 2: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/2.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
postrank.com/topic/ruby
The slides… Twi+er My blog
![Page 3: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/3.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Ruby + Math OpDmizaDon
PageRank
Indexing Examples Misc Fun
![Page 4: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/4.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank PageRank + Ruby
Indexing Examples Tools +
OpDmizaDon
![Page 5: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/5.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Consume with care… everything that follows is based on released / public domain info
![Page 6: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/6.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Search‐engine graveyard Google did pre9y well…
![Page 7: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/7.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Search pipeline 50,000‐foot view
Query: Ruby
Results
1. Crawl 2. Index 3. Rank
![Page 8: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/8.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Query: Ruby
Results
1. Crawl 2. Index 3. Rank
Bah Fun InteresDng
![Page 9: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/9.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
circa 1997‐1998
CPU Speed 333Mhz RAM 32‐64MB
Index 27,000,000 documents Index refresh once a month~ish PageRank computaCon several days
Laptop CPU 2.1Ghz VM RAM 1GB 1‐Million page web ~10 minutes
![Page 10: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/10.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
CreaDng & Maintaining an Inverted Index DIY and the gotchas within
![Page 11: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/11.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Building an Inverted Index
require 'set'
pages = { "1" => "it is what it is", "2" => "what is it", "3" => "it is a banana" }
index = {}
pages.each do |page, content| content.split(/\s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end
{ "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} }
![Page 12: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/12.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Building an Inverted Index
require 'set'
pages = { "1" => "it is what it is", "2" => "what is it", "3" => "it is a banana" }
index = {}
pages.each do |page, content| content.split(/\s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end
{ "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} }
![Page 13: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/13.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Building an Inverted Index
require 'set'
pages = { "1" => "it is what it is", "2" => "what is it", "3" => "it is a banana" }
index = {}
pages.each do |page, content| content.split(/\s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end
{ "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} }
Word => [Document]
![Page 14: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/14.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Querying the index
# query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}>
# query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}>
# query: "what is" p index["what"] & index["is"] # > #<Set: {"1", "2"}>
{ "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} }
1 3 2
![Page 15: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/15.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Querying the index
# query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}>
# query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}>
# query: "what is" p index["what"] & index["is"] # > #<Set: {"1", "2"}>
{ "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} }
1 3 2
![Page 16: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/16.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Querying the index
# query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}>
# query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}>
# query: "what is" p index["what"] & index["is"] # > #<Set: {"1", "2"}>
{ "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} }
1 3 2
![Page 17: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/17.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Querying the index
# query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}>
# query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}>
# query: "what is" p index["what"] & index["is"] # > #<Set: {"1", "2"}>
{ "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} }
What order?
[1, 2] or [2,1]
![Page 18: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/18.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Building an Inverted Index
require 'set'
pages = { "1" => "it is what it is", "2" => "what is it", "3" => "it is a banana" }
index = {}
pages.each do |page, content| content.split(/\s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end
Hmmm?
PDF, HTML, RSS? Lowercase / Upcase?
Compact Index? Stop words? Persistence?
![Page 19: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/19.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
![Page 20: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/20.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Ferret is a high‐performance, full‐featured text search engine library wri9en for Ruby
![Page 21: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/21.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
require 'ferret' include Ferret
index = Index::Index.new()
index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"}
index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end
> Score: 1.0, 3
![Page 22: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/22.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
require 'ferret' include Ferret
index = Index::Index.new()
index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"}
index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end
> Score: 1.0, 3
Hmmm?
![Page 23: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/23.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
class Ferret::Analysis::Analyzer class Ferret::Analysis::AsciiLe+erAnalyzer class Ferret::Analysis::AsciiLe+erTokenizer class Ferret::Analysis::AsciiLowerCaseFilter class Ferret::Analysis::AsciiStandardAnalyzer class Ferret::Analysis::AsciiStandardTokenizer class Ferret::Analysis::AsciiWhiteSpaceAnalyzer class Ferret::Analysis::AsciiWhiteSpaceTokenizer class Ferret::Analysis::HyphenFilter class Ferret::Analysis::Le+erAnalyzer class Ferret::Analysis::Le+erTokenizer class Ferret::Analysis::LowerCaseFilter class Ferret::Analysis::MappingFilter class Ferret::Analysis::PerFieldAnalyzer class Ferret::Analysis::RegExpAnalyzer class Ferret::Analysis::RegExpTokenizer class Ferret::Analysis::StandardAnalyzer class Ferret::Analysis::StandardTokenizer class Ferret::Analysis::StemFilter class Ferret::Analysis::StopFilter class Ferret::Analysis::Token class Ferret::Analysis::TokenStream class Ferret::Analysis::WhiteSpaceAnalyzer class Ferret::Analysis::WhiteSpaceTokenizer
class Ferret::Search::BooleanQuery class Ferret::Search::ConstantScoreQuery class Ferret::Search::ExplanaCon class Ferret::Search::Filter class Ferret::Search::FilteredQuery class Ferret::Search::FuzzyQuery class Ferret::Search::Hit class Ferret::Search::MatchAllQuery class Ferret::Search::MulCSearcher class Ferret::Search::MulCTermQuery class Ferret::Search::PhraseQuery class Ferret::Search::PrefixQuery class Ferret::Search::Query class Ferret::Search::QueryFilter class Ferret::Search::RangeFilter class Ferret::Search::RangeQuery class Ferret::Search::Searcher class Ferret::Search::Sort class Ferret::Search::SortField class Ferret::Search::TermQuery class Ferret::Search::TopDocs class Ferret::Search::TypedRangeFilter class Ferret::Search::TypedRangeQuery class Ferret::Search::WildcardQuery
![Page 24: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/24.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
ferret.davebalmain.com/trac
![Page 25: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/25.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Ranking Results 0‐60 with PageRank…
![Page 26: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/26.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Naïve: Term Frequency
index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end
> Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4
Relevance?
3 5 4
the 4 3 5
brown 1 3 1
cow 1 4 1
Score 6 10 7
![Page 27: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/27.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Naïve: Term Frequency
index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end
> Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4
Skew
3 5 4
the 4 3 5
brown 1 3 1
cow 1 4 1
Score 6 10 7
![Page 28: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/28.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
TF‐IDF Term Frequency * Inverse Document Frequency
Skew
3 5 4
the 4 3 5
brown 1 3 1
cow 1 4 1
Total # of documents: 10
# of docs
the 6
brown 3
cow 4
Score = TF * IDF
TF = # occurrences / # words IDF = # docs / # docs with W
![Page 29: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/29.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
TF‐IDF Score = 0.204 + 0.120 + 0.092 = 0.416
# of docs
the 6
brown 3
cow 4
3 5 4
the 4 3 5
brown 1 3 1
cow 1 4 1
Total # of documents: 10 # words in document: 10
Doc # 3 score for ‘the’: 4/10 * ln(10/6) = 0.204
Doc # 3 score for ‘brown’: 1/10 * ln(10/3) = 0.120
Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092
![Page 30: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/30.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Frequency Matrix
W1 W2 … … … … … … WN
Doc 1 15 23 …
Doc 2 24 12 …
… … … …
…
Doc K
Size = N * K * size of Ruby object Ouch.
Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes
Footprint = 384 MB
![Page 31: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/31.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
NArray h9p://narray.rubyforge.org/
NArray is an Numerical N‐dimensional Array class (implemented in C)
NArray.new(typecode, size, ...) NArray.byte(size,...) NArray.sint(size,...) NArray.int(size,...) NArray.sfloat(size,...) NArray.float(size,...) NArray.scomplex(size,...) NArray.complex(size,...) NArray.object(size,...)
# create new NArray. initialize with 0. # 1 byte unsigned integer # 2 byte signed integer # 4 byte signed integer # single precision float # double precision float # single precision complex # double precision complex # Ruby object
![Page 32: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/32.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
NArray h9p://narray.rubyforge.org/
NArray is an Numerical N‐dimensional Array class (implemented in C)
![Page 33: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/33.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank the google juice
Links as votes
Problem: link gaming
![Page 34: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/34.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Random Surfer powerful abstracJon
Follow link from page he/she is currently on.
Teleport to a random locaGon on the web.
P = 0.85
P = 0.15
![Page 35: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/35.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Surfin’ rinse & repeat, ad naseum
Follow link from page he/she is currently on.
Teleport to a random locaGon on the web.
Page K
Page N Page M
![Page 36: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/36.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Surfin’ rinse & repeat, ad naseum
On Page P, clicks on link to K
P = 0.15
P = 0.85
On Page K clicks on link to M
On Page M teleports to X
…
P = 0.85
![Page 37: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/37.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Analyzing the Web Graph extracJng PageRank
P = 0.6
N
MK
X
P = 0.15
P = 0.20 P = 0.05
![Page 38: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/38.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
What is PageRank? It’s a scalar!
![Page 39: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/39.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
What is PageRank? it’s a probability!
P = 0.6
N
MK
X
P = 0.15
P = 0.20 P = 0.05
P = 0.6
P = 0.15
P = 0.20 P = 0.05
P = 0.6
P = 0.15
P = 0.20 P = 0.05
![Page 40: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/40.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
What is PageRank? it’s a probability!
P = 0.6
N
MK
X
P = 0.15
P = 0.20 P = 0.05
P = 0.6
P = 0.15
P = 0.20 P = 0.05
Higher Pr, Higher Importance?
![Page 41: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/41.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
TeleportaDon? sci‐fi fans, … ?
![Page 42: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/42.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Reasons for teleportaDon enumeraJng edge cases
N
M
K
X
1. No in‐links!
M
2. No out‐links!
3. Isolated Web
![Page 43: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/43.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Exploring Graphs gratr.rubyforge.com
• Breadth First Search • Depth First Search • A* Search • Lexicographic Search • Dijkstra’s Algorithm • Floyd‐Warshall • TriangulaCon and Comparability detecCon
require 'gratr/import'
dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]
dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4]
Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]
![Page 44: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/44.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
TeleportaDon probabiliJes
N
M
K
X
M
P(T) = 0.03
P(T) = 0.03
P(T) = 0.03
P(T) = 0.03
P(T) = 0.03
P(T) = 0.15 / # of pages P(T) = 0.03
![Page 45: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/45.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank: Simplified MathemaDcal Def’n cause that’s how we roll
Assume the web is N pages big Assume that probability of teleportaCon (t) is 0.15, and following link (s) is 0.85 Assume that teleportaCon probability (E) is uniform Assume that you start on any random page (uniform distribuDon L), then
Then a^er one step, the probability your on page X is:
![Page 46: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/46.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
G = The Link Graph ginormous and sparse
1 2 … … N
1 1 0 … … 0
2 0 1 … … 1
… … … … … …
… … … … … …
N 0 1 … … 1
Link Graph No link from 1 to N
Huge!
![Page 47: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/47.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
G as a dicDonary more compact…
{ "1" => [25, 26], "2" => [1], "5" => [123,2], "6" => [67, 1] }
Page
Links to…
![Page 48: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/48.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
CompuDng PageRank the tedious way
Follow link from page he/she is currently on.
Teleport to a random locaGon on the web.
Page K
![Page 49: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/49.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
CompuDng PageRank in one swoop
IdenDty matrix
Don’t trust me! Verify it yourself!
![Page 50: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/50.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Enough hand‐waving, dammit! show me the code
![Page 51: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/51.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Birth of EM‐Proxy flash of the obvious
Hot, Fast, Awesome
![Page 52: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/52.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Hot, Fast, Awesome
h:p://rb‐gsl.rubyforge.org/
Click there! … Give yourself a weekend.
![Page 53: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/53.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Click there! … Give yourself a weekend. h:p://ruby‐gsl.sourceforge.net/
![Page 54: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/54.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank in Ruby 6 lines, or less
require "gsl" include GSL
# INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2
i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector
s = 0.85 # probability of following a link t = 1-s # probability of teleportation
t*((i-s*g).invert)*p end
Verify NxN
![Page 55: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/55.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank in Ruby 6 lines, or less
require "gsl" include GSL
# INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2
i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector
s = 0.85 # probability of following a link t = 1-s # probability of teleportation
t*((i-s*g).invert)*p end
Constants…
![Page 56: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/56.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank in Ruby 6 lines, or less
require "gsl" include GSL
# INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2
i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector
s = 0.85 # probability of following a link t = 1-s # probability of teleportation
t*((i-s*g).invert)*p end
PageRank!
![Page 57: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/57.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Ex: Circular Web tesJng intuiJon…
N
K
X P = 0.33
pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) > [0.33, 0.33, 0.33]
P = 0.33
P = 0.33
![Page 58: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/58.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Ex: All roads lead to K tesJng intuiJon…
N
K
X P = 0.07
pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) > [0.05, 0.07, 0.87]
P = 0.87
P = 0.05
![Page 59: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/59.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank + Ferret awesome search, Tw!
![Page 60: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/60.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
require 'ferret' include Ferret
index = Index::Index.new()
index << {:title => "1", :content => "it is what it is", :pr => 0.05 } index << {:title => "2", :content => "what is it", :pr => 0.07 } index << {:title => "3", :content => "it is a banana", :pr => 0.87 }
1
3
2 P = 0.07
P = 0.87
P = 0.05
Store PageRank
![Page 61: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/61.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end
puts "*" * 50
sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)
index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end
# Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05)
TF‐IDF Search
![Page 62: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/62.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end
puts "*" * 50
sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)
index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end
# Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05)
PageRank FTW!
![Page 63: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/63.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end
puts "*" * 50
sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)
index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end
# Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05)
Others
![Page 64: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/64.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Search*: Graphs are ubiquitous! PageRank is a general purpose hammer
![Page 65: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/65.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank + Social Graph GitHub
Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84
h:p://bit.ly/3YQPU
![Page 66: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/66.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank + Social Graph Twi9er
Hmm…
Analyze the social graph: ‐ Filter messages by ‘Twi:erRank’ ‐ Suggest users by ‘Twi:erRank’ ‐ …
![Page 67: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/67.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank + Product Graph E‐commerce
Link items purchased in same cart… Run PR on it.
![Page 68: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/68.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank = Powerful Hammer use it!
![Page 69: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/69.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PersonalizaDon how would you do it?
![Page 70: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/70.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
PageRank + PersonalizaDon customize the teleportaJon vector
TeleportaDon distribuDon doesn’t have to be uniform!
yahoo.com is my homepage!
![Page 71: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/71.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
Gaming PageRank for fun and profit (I don’t endorse it)
Make pages with links!
hXp://bit.ly/pagerank‐spam
![Page 72: Building A Mini Google High Performance Computing In Ruby Presentation 1](https://reader033.fdocuments.in/reader033/viewer/2022060116/5580bcadd8b42ac6088b507a/html5/thumbnails/72.jpg)
Building Mini‐Google in Ruby @igrigorik #railsconf h:p://bit.ly/railsconf‐pagerank
QuesDons?
The slides… Twi+er My blog
Slides: hXp://bit.ly/railsconf‐pagerank
Ferret: hXp://bit.ly/ferret RB‐GSL: hXp://bit.ly/rb‐gsl
PageRank on Wikipedia: hXp://bit.ly/wp‐pagerank Gaming PageRank: hXp://bit.ly/pagerank‐spam
Michael Nielsen’s lectures on PageRank: hXp://michaelnielsen.org/blog