Page Rank
description
Transcript of Page Rank
![Page 1: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/1.jpg)
![Page 2: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/2.jpg)
Page Rank
![Page 3: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/3.jpg)
Overview• Two dimensional arrays• Monte Carlo algorithms• Searching the world wide web• Big data• Page rankGoal: we will write a program to compute the relevancy of WWW documents based on the static structure of the WWW.
![Page 4: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/4.jpg)
Two Dimensional Arrays• Significance (a topic on the AP Computer Science A exam)• Syntax• Example of matrix multiplication• Arrays of arrays
![Page 5: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/5.jpg)
Significance of Two Dimensional Arrays• Tables; for instance, assignments for each student in a class, quarterly
sales for each item in inventory, etc.• Matrices and binary relations in mathematics. For example, is there a
direct road from city1 in USA to city2 in USA?• For our goal in the this section, we will have need for the number of
links from doc1 in the WWW to doc2 in the WWW.
![Page 6: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/6.jpg)
Syntax• int[][] frequency = new int [26][26];• Elements are accessed: frequency[4][7] and not frequency[4,7]• Array indices in Java (like C, C++, C#) always begin with 0; in other
words, the element with index 1 is the second element of the array.
![Page 7: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/7.jpg)
Matrix multiplication
![Page 8: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/8.jpg)
Matrix Multiplication Exercise• http://cs.fit.edu
/~ryan/java/programs/basic_algorithms/MatrixMultiplication2.java
![Page 9: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/9.jpg)
Arrays of Arrays• Logically: arrays of arrays in the tradition of C and C++. Very simple.• Unfortunately: introduces pointers, memory allocation, etc. Very
complicated.
![Page 10: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/10.jpg)
Monte Carlo Methods• Introduction• The example of a Monte Carlo estimate for Pi (Java exercise). Fair
shuffling (Java exercise). Random walk (important in financial analysis)• Used in path tracing to create realistic images• Percolation – an example of the power of a Monte Carlo algorithmGoal: we will write a Monte Carlo algorithm to estimate the relevancy of WWW documents based on the static structure of the WWW.
![Page 11: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/11.jpg)
Monte Carlo Casino• The name refers to the grand casino in the
Principality of Monaco at Monte Carlo, which is well-known around the world as an icon of gambling.
![Page 12: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/12.jpg)
Monte Carlo estimate for PiJava exercise: http://cs.fit.edu/~ryan/java/programs/basic_algorithms/ComputePi2.java
Since we know the value of pi it is not really necessary to invent an algorithm to estimate its value.
![Page 13: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/13.jpg)
Fair shuffling (Java exercise)
• How would you test a algorithm for shuffling, say, cards? In particular how would you know if all of the many possible results are equally likely?• Main program
http://cs.fit.edu/~ryan/java/programs/basic_algorithms/Experiment.java. Nothing to write; requires the method to shuffle.• http://cs.fit.edu/~ryan/java/programs/basic_algorithms/Shuffle.java
contains two methods of shuffling cards.• Run the experiment with multiple trials and convince yourself both
methods are fair
![Page 14: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/14.jpg)
Percolation TheoryPercolation. Pour liquid on top of some porous material.Will liquid reach the bottom? Many applications in chemistry, materials science, etc.• Spread of forest fires.• Natural gas through semi-porous rock.• Flow of electricity through network of resistors.• Permeation of gas in coal mine through a gas mask filter.
![Page 15: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/15.jpg)
15
Percolation TheoryGiven an N-by-N system where each site is vacant with probability p, what is the probability that system percolates?
Remark. Famous open question in statistical physics. No known mathematical solution. Computational thinking creates new science.Recourse. Take a computational approach: Monte Carlo simulation.Uses a recursive, dfs algorithm, but diverges from the present topic. (Recursion is a topic on the AP Computer Science A exam.)
p = 0.3(does not percolate)
p = 0.4(does not percolate)
p = 0.5(does not percolate)
p = 0.6(percolates)
p = 0.7(percolates)
![Page 16: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/16.jpg)
We will examine a Monte Carlo algorithm for estimating the relevancy of WWW documents.
![Page 17: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/17.jpg)
Random Walk• Page rank can be computed a lot like random walk• See the Java applet (1 dim) at
http://www.math.uah.edu/stat/applets/RandomWalkExperiment.html• See the Java applet (2 dim) at
http://vlab.infotech.monash.edu.au/simulations/swarms/random-walk/
![Page 18: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/18.jpg)
Searching the World Wide Web• History of Search Engines• Hypertext• Crawling the World Wide Web• Indexing
![Page 19: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/19.jpg)
History of Search Engines• History of Search by Larry Kim of WordStream
![Page 20: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/20.jpg)
Markup and Hypertext• Documents served up through the WWW are generally “marked up”
for presentation in a structured, standard called hypertext markup language (HTML).• The most important feature of HTML is the referencing (via URLs) of
other WWW documents which enables easy, non-sequential, and varied paths of reading the documents.
![Page 21: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/21.jpg)
Hypertext
![Page 22: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/22.jpg)
WWW Spiders• Google, and others, continually, crawl around the WWW recording
what they see to enable searching.
![Page 23: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/23.jpg)
44% of hits and 35% of bandwidth is attributable to bots (and other odd things).
July 2013 (up to 9:30 am 26 Jul 2013) on the WWW server cs.fit.edu
Russian search engine
![Page 24: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/24.jpg)
Indexing• Finding a relevant document in a vast ocean of linked HTML
documents requires a very large index.• An index is a (sorted) list of keywords (terms) and the list of values
(URLs) which contain them.
![Page 25: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/25.jpg)
An example index of WWW documentsBourgeois .../manifesto.txtHero …/lilwomen.txt, …/muchado.txt, …/war+peace.txtHis .../manifesto.txt, …/lilwomen.txt, …/mobydick.txt, …/muchado.txt, …/war+peace.txtTreachery …/war+peace.txtWhale …/mobydick.txtYellowish …/lilwomen.txt , …/war+peace.txt
![Page 26: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/26.jpg)
Several Issues• Pick out the words from the mark-up• What’s a word? 2nd, abc’s, CSTA• Normalize: lowercase, stemming• Some words are not worth indexing• “the”, “a”, etc.• A so-called stop list, eg., words ignored in Wikipedia search• Java exercise: http://cs.fit.edu/~ryan/java/programs/xml/URLtoText.java
First some preliminary remarks before doing the exercise.
![Page 27: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/27.jpg)
Searching and SortingProblem: Determine if the word is in the stop list. What is the best approach?
• Searching: linear search, binary search. (These are topics on the AP Computer Science A exam.) Binary search requires the data (the index, for example) to be sorted.• Sorting: selection sort, insertion sort, merge sort, quick sort; external
sorting. (The first three of these sorts are topics on the AP Computer Science A exam.)
![Page 28: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/28.jpg)
Linear versus Binary search Suppose each
comparison takes one
millisecond (0.001)
![Page 29: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/29.jpg)
Linear versus Binary Search
![Page 30: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/30.jpg)
Linear versus Binary Search
![Page 31: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/31.jpg)
Obama at Google• https://www.youtube.com/watch?v=k4RRi_ntQc8
![Page 32: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/32.jpg)
Sorting Demo• http://cs.fit.edu/~ryan/cse1002/sort.html• See also sorting illustrated by Algo-rythmics
http://algo-rythmics.ms.sapientia.ro and folk dancers
![Page 33: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/33.jpg)
Now do the exercise• Java exercise: http://cs.fit.edu/~ryan/java/programs/xml/URLtoText.java
• PS. How to students really program?• http://xkcd.com/1185 Observe the tool tip!
![Page 34: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/34.jpg)
OK, we have a keyword index. It is likely we still have “gazillion” documents, for most of the terms. (See Googlewacks, Googlewhackblatt; one and two words search terms that return one document.)How do we find the most relevant pages?
![Page 35: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/35.jpg)
Big Data• The problem• Count-Min Algorithm
![Page 36: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/36.jpg)
The problem with Big DataConsider a popular website which wants to keep track of statistics on the queries used to search the site. One could keep track of the full log of queries, and answer exactly the frequency of any search query at the site. However, the log can quickly become very large. This problem is an instance of the count tracking problem. Even known sophisticated solutions for fast querying such as a tree-structure or hash table to count up the multiple occurrences of the same query, can prove to be slow and wasteful of resources. Notice that in this scenario, we can tolerate a little imprecision. In general, we are interested only in the queries that are asked frequently. So it is acceptable if there is some fuzziness in the counts. Thus, we can tradeoff some precision in the answers for a more efficient and lightweight solution. This tradeoff is at the heart of sketches.Cormode and Muthurishnon, 2011
![Page 37: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/37.jpg)
![Page 38: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/38.jpg)
![Page 39: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/39.jpg)
Page Rank• Not finding pages, but ordering the found pages• Makes a big difference in the user’s experience, if “good” or “relevant”
pages come first.• The upcoming algorithm gave Google a competitive advantage• How would you rank pages? • The approach/algorithm called “page rank” is not based on the WWW
surfer as voter (popularity), but on the WWW author as voter (hence relatively static)• Conceptually in the Page Rank algorithm a random surfer mindlessly
follows the hyperlinks of the entire WWW
![Page 40: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/40.jpg)
![Page 41: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/41.jpg)
Input/Output• What is the input? The entire WWW modeled as a graph.• What is the output? The ranking of every page in the WWW.• By assigning one number to every page, then the search query will
order the found pages by the rankings in order to present to the user the most relevant pages first.
![Page 42: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/42.jpg)
S&W Tiny Hypertext
![Page 43: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/43.jpg)
S&W Tiny Graph
![Page 44: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/44.jpg)
S&W Tiny: Adj list & Adj matrix50 11 2 1 21 3 1 3 1 42 33 04 0 4 2
5 5 0 1 0 0 0 0 0 2 2 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0
![Page 45: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/45.jpg)
0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
PAGE RANK TINY
![Page 46: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/46.jpg)
Wiki2 Hypertext
![Page 47: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/47.jpg)
Wiki2 Graph
![Page 48: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/48.jpg)
Wiki2: Adj List & Adj Matrix70 1 0 2 0 3 0 4 0 61 02 0 2 13 1 3 2 3 44 0 4 2 4 3 4 55 0 5 46 4
7 70 1 1 1 1 0 11 0 0 0 0 0 01 1 0 0 0 0 00 1 1 0 1 0 01 0 1 1 0 1 0 1 0 0 0 1 0 00 0 0 0 1 0 0
![Page 49: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/49.jpg)
0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
PAGE RANK WIKI2
![Page 50: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/50.jpg)
Wiki1 Hypertext
![Page 51: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/51.jpg)
Wiki1 Graph
![Page 52: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/52.jpg)
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
PAGERANK WIKI1
![Page 53: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/53.jpg)
Java Exercise• Modify Adajency1.java
1. Print adjacency matrix2. Print probability matrix3. Print probability matrix with 90-10 rule
![Page 54: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/54.jpg)
Interactive WWW Page for PageRank• http://williamcotton.com/pagerank-explained-with-javascript
![Page 55: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/55.jpg)
Reachability, Markov Theory
Can node 2 reach node 4? Yes, using a path of length 2 through node 3.
![Page 56: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/56.jpg)
![Page 57: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/57.jpg)
![Page 58: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/58.jpg)
Final Challenge• Raise the page rank of page “23” by modifying only the links on page
“23”• Decrease the page rank of page “23” by modifying only the links on
page “23”• Can you find the maximum/minimum page rank?
![Page 59: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/59.jpg)
Search engine optimization, link schemes, link farming, Google bombs
![Page 60: Page Rank](https://reader035.fdocuments.in/reader035/viewer/2022062410/568161a6550346895dd16535/html5/thumbnails/60.jpg)
Ted Talks: Brin & Page: The Genesis of Google• http://www.ted.com/talks/
sergey_brin_and_larry_page_on_google.html