1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu &...
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu &...
![Page 1: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/1.jpg)
1
Graphs & more on Web search
15-211 Fundamental Data Structures and Algorithms
Stefan Niculescu & James Lyons March 21, 2002
![Page 2: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/2.jpg)
Announcements
![Page 3: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/3.jpg)
3
Homework 5
Homework Assignment #5 will be out on Friday.
Must do some reading in order to complete it.
Must take a progress quiz.
Get started today and as usual, think b4 u hack!
![Page 4: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/4.jpg)
4
Reading
About graphs:
Chapter 14
About Web search:http://www.cadenza.org/search_engine_terms/
srchad.htm
A HTML tutorial:http://www.cwru.edu/help/introHTML/toc.html
![Page 5: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/5.jpg)
Introduction to Graphs
![Page 6: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/6.jpg)
6
Graphs — an overview
Vertices (aka nodes)
![Page 7: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/7.jpg)
7
Graphs — an overview
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
![Page 8: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/8.jpg)
8
Undirected
Graphs — an overview
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
![Page 9: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/9.jpg)
9
Undirected
Graphs — an overview
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
1987
2273
3442145
2462
618
211
318
190
Weights
![Page 10: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/10.jpg)
10
Terminology
Graph G = (V,E)
Set V of vertices (nodes)
Set E of edges Elements of E are pairs (v,w) where v,w V.
An edge (v,v) is a self-loop. (Usually assume no self-loops.)
Weighted graph
Elements of E are (v,w,x) where x is a weight.
![Page 11: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/11.jpg)
11
Terminology, cont’d
Directed graph (digraph)The edge pairs are ordered
Every edge has a specified direction
The Web is a directed graph
Undirected graphThe edge pairs are unordered
E is a symmetric relation
(v,w) E implies (w,v) E
In an undirected graph (v,w) and (w,v) are usually treated as though they are the same edge
![Page 12: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/12.jpg)
12
Directed Graph (digraph)
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
![Page 13: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/13.jpg)
13
Undirected Graph
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
![Page 14: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/14.jpg)
14
Terminology, cont’d
v and w adjacent (neighbors) if (v,w)E or (w,v)E
d(v) (degree of v) = # neighbors of v (for undirected graphs)
d+(v) (out-degree of v)= # of edges (v,w)E
d-(v) (in-degree of v)= # of edges (w,v)E
![Page 15: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/15.jpg)
15
Terminology, cont’d
Path
a list of nodes (v[1], v[2],...,v[n]) s.t. (v[i],v[i+1]) E for all 0 < i < n
The length of the above path is n-1
Cycle
a path that begins and ends with the same node
Cyclic graph – contains at least one cycle
Acyclic graph - no cycles
![Page 16: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/16.jpg)
16
Elements of a Graph
PIT
BOS
JFK
DTW
LAX
SFO
![Page 17: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/17.jpg)
17
Terminology, cont’d
Subgraph of a graph G
a subset of V with the corresponding edges from E.
Connected graph
a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other.
Connected component of a graph G a connected subgraph of G.
![Page 18: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/18.jpg)
18
Elements of a Graph, cont’d
PIT
BOS
JFK
DTW
LAX
SFO
![Page 19: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/19.jpg)
19
Terminology, cont’d
Unrooted (undirected) tree
a acyclic connected undirected graph
Theorem: in any unrooted tree T=(V,E), |V|=|E|+1.
Proof: by induction on |V| Base case |V|=1 (|E|=0)
Show there exists a node of degree one
Remove that node and apply induction hypothesis
![Page 20: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/20.jpg)
20
Example of a unrooted tree
PIT
BOS
JFK
DTW
LAX
SFO
![Page 21: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/21.jpg)
21
Quiz Break
![Page 22: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/22.jpg)
22
So, is this a connected graph?
Cyclic or Acyclic?
Directed or Undirected?
6
4
5
3
1
2
![Page 23: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/23.jpg)
23
Directed graph (unconnected)
Cyclic or Acyclic?
6
5
32
1
4
![Page 24: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/24.jpg)
Representing Graphs
![Page 25: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/25.jpg)
25
Representing graphs
Adjacency matrix 2-dimensional array
For each edge (u,v), set A[u][v] to true; otherwise false
1
4
76
3 5
2
1 2 3 4 5 6 7
1 x x
2 x x
3 x
4 x x x
5 x x
6
7
3 41
4 52
63
4
3 67
5
4 7
6
7
Adjacency lists For each vertex, keep a list of adjacent vertices
![Page 26: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/26.jpg)
26
Choosing a representation
Size of V relative to size of E is a primary factor.
Dense: |E|/|V| is large
Sparse: |E|/|V| is small
Adjacency matrix is expensive in terms of space if the graph is sparse (O(|V|2 >
O(|E|+|V|)).
Adjacency list is expensive in terms of checking edges if the graph is dense.
![Page 27: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/27.jpg)
27
Size of a Graph
How many undirected graphs for a set of n given vertices?
Answer:
2
n
22n
How many edges in a undirected graph with n vertices?
Minimum: 0
Maximum:
![Page 28: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/28.jpg)
Graphs are Everywhere
![Page 29: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/29.jpg)
29
Graphs as models
The Internet
Communication pathways
DNS hierarchy
The WWW
The physical world
Road topology and maps
Airline routes and fares
Electrical circuits
Job and manufacturing scheduling
![Page 30: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/30.jpg)
30
Graphs as models
Physical objects are often modeled by meshes, which are a particular kind of graph structure.
By Jonathan Shewchuk
![Page 31: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/31.jpg)
31
More graph models
See alsohttp://java.sun.com/applets/jdk/1.1/demo/WireFrame/index.htmlandhttp://www.mapquest.com
NASA CFD labs
By Paul Heckbert andDavid Garland
![Page 32: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/32.jpg)
32
Structure of the Internet
Europe
Japan
Backbone 1
Backbone 2
Backbone 3
Backbone 4, 5, N
Australia
Regional A
Regional B
NAP
NAP
NAP
NAP
SOURCE: CISCO SYSTEMS
MAPS UUNET MAP
![Page 33: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/33.jpg)
33
Relationship graphs
Graphs are also used to model relationships among entities.
Scheduling and resource constraints.
Inheritance hierarchies.15-113 15-151
15-211
15-212 15-213
15-312
15-411 15-412
15-251
15-45115-462
![Page 34: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/34.jpg)
34
Where are we right now?
![Page 35: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/35.jpg)
The Web Graph
![Page 36: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/36.jpg)
36
Web Graph
Documents written in HTML
HTML (HyperText Markup Language)
TAGS:
<head>, <body>, <title>
<H1>, <p>
<a> (anchor, link)
![Page 37: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/37.jpg)
37
A simple HTML example
<html>
<head>
<TITLE>A Simple HTML Example</TITLE>
</head>
<body>
<a href="http://www.cmu.edu">
Carnegie Mellon University </a>
</body>
</html>
![Page 38: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/38.jpg)
38
Web Graph
A directed graph where :
V = (all web pages)
E = (all HTML-defined links from one
web page to another)
![Page 39: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/39.jpg)
39
Web Graph
<a href …>
<a href …>
<a href …>
<a href …>
<a href …>
<a href …>
<a href …>
Web Pages are nodes (vertices)HTML references are links (edges)
![Page 40: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/40.jpg)
40
Is the Web Graph connected?
Sparse, unconnected graph
AUTHORITIES
web pages containing a “reasonable” amount of relevant information about a specific topic
HUBS
web pages that point (link) to many pages containing relevant information about a given topic
![Page 41: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/41.jpg)
41
Finding Hubs & Authorities
Nice iterative algorithm by Jon Kleinberg
www.cs.cornell.edu/home/kleinber/auth.ps
HUB: Avrim’s Machine Learning page
www.cs.cmu.edu/~avrim/ML
AUTHORITY: www.java.sun.com
Extra credit opportunity for homework 5
![Page 42: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/42.jpg)
Graphs : ApplicationSearch Engines
![Page 43: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/43.jpg)
43
Search Engines
![Page 44: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/44.jpg)
44
What are they?
Tools for finding information on the Web
Problem: “hidden” databases, e.g. New York Times (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)
Search engine
A machine-constructed index (usually by keyword)
So many search engines, we need search engines to find them. Searchenginecollosus.com
![Page 45: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/45.jpg)
45
Did you know?
Vivisimo was developed here at CMU
Developed by Prof. Raul Valdes-Perez
Developed in 2000
![Page 46: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/46.jpg)
46
SE Architecture
Spider Crawls the web to find pages. Follows hyperlinks.
Never stops
Indexer Produces data structures for fast searching of all
words in the pages (ie, it updates the lexicon)
Retriever Query interface
Database lookup to find hits 1 billion documents
1 TB RAM, many terabytes of disk
Ranking
![Page 47: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/47.jpg)
47
A look at
10,000 servers (WOW!)
Web site traffic grows over 20% per month
Spiders over 2 Billion URLs
Supports 28 language searches
Over 100 million searches per day
“Even CMU uses it!”
![Page 48: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/48.jpg)
48
Google’s server farm
![Page 49: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/49.jpg)
49
Web Crawlers
Start with an initial page P0. Find URLs on P0 and add them to a queue
When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat
Can be specialized (e.g. only look for email addresses)
Issues Which page to look at next? (Special subjects, recency)
How deep within a site do you go (depth search)?
How frequently to visit pages?
![Page 50: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/50.jpg)
50
So, why Spider the Web?
Refresh Collection by deleting dead links
OK if index is slightly smaller
Done every 1-2 weeks in best engines
Finding new sites
Respider the entire web
Done every 2-4 weeks in best engines
![Page 51: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/51.jpg)
51
Cost of Spidering
Spider can (and does) run in parallel on
hundreds of severs
Very high network connectivity (e.g. T3 line)
Servers can migrate from spidering to query
processing depending on time-of-day load
Running a full web spider takes days even with
hundreds of dedicated servers
![Page 52: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/52.jpg)
52
Indexing
Arrangement of data (data structure) to permit fast searching
Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak
Sorting helps. Why?
Permits binary search. About log2n probes into list
log2(1 billion) ~ 30
Permits interpolation search. About log2(log2n) probes
log2 log2(1 billion) ~ 5
![Page 53: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/53.jpg)
53
Inverted Files
A file is a list of words by position
- First entry is the word in position 1 (first word)
- Entry 4562 is the word in position 4562 (4562nd word)
- Last entry is the last word
An inverted file is a list of positions by word!
POS1
10
20
30
36
FILE
a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)
INVERTED FILE
![Page 54: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/54.jpg)
54
Inverted Files for Multiple Documents
107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802
WORD NDOCS PTR
jezebel 20
jezer 3
jezerit 1
jeziah 1
jeziel 1
jezliah 1
jezoar 1
jezrahliah 1
jezreel 39jezoar
34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
67 1 132. . .
“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
WORD INDEX
![Page 55: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/55.jpg)
55
Ranking (Scoring) Hits
Hits must be presented in some order
What order? Relevance, recency, popularity, reliability, alphabetic?
Some ranking methods Presence of keywords in title of document
Closeness of keywords to start of document
Frequency of keyword in document
Link popularity (how many pages point to this one)
![Page 56: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/56.jpg)
56
Spamdexing & Link Popularity
Spamdexing means influencing retrieval ranking by altering a web page. (Puts “spam” in the index)
www.linkpopularity.com
Link popularity is used for ranking
Many measures
Number of links in (In-links)
Weighted number of links in (by weight of referring page)
![Page 57: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/57.jpg)
57
Search Engine Sizes
AV AltavistaEX ExciteFAST FASTGG GoogleINK InktomiNL Northern Light
SOURCE: SEARCHENGINEWATCH.COM
![Page 58: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/58.jpg)
58
Historical Notes
WebCrawler: first documented spider
Lycos: first large-scale spider
Top-honors for most web pages spidered: First Lycos, then AltaVista, then Google...
![Page 59: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d395503460f94a12a02/html5/thumbnails/59.jpg)
59
Overview
Engines are a critical Web resource
Very sophisticated, high technology, but secret
Most spidering re-traverses stable web graph
They don’t spider the Web completely
Spamdexing is a problem
New paradigms needed as Web grows
What about images, music, video?
Google’s image search engine
Napster