The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden Ravi Kumar,...

36
The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins (IBM, Almaden) Andrei Broder, Farzin Maghoul (AltaVista Corp.) Raymie Stata, Janet Wiener (Compaq SRC) Eli Upfal (Brown University)
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden Ravi Kumar,...

The web as a graph: structure and interpretation.

Sridhar RajagopalanIBM Almaden

Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins (IBM, Almaden)

Andrei Broder, Farzin Maghoul (AltaVista Corp.)

Raymie Stata, Janet Wiener (Compaq SRC)

Eli Upfal (Brown University)

A Picture of (~200M) pages.

Part A: Structure.

• The graph.• The Questions ….• … and some Answers. • The picture.

The web graph

• Nodes (or vertices) = web pages. Edges = (non-nepotistic) links.• The graph = all web pages and links.• Many nodes, estimates range from 500M to over 1B.• Is very sparse. Average links/page between 5-10.• Average (links/page | more than 6 links) > 30.

Concentrate on graph structure, ignore content.

Questions about the web graph

• How big is the graph?How many links on a page (outdegree)? How many links to a page (indegree)?

• Can one browse from any web page to any other? How many clicks?

• Can we pick a random page on the web?

• How different is browsing from a “random walk”?

• Can we exploit the structure of the web graph for searching and mining?

• What does the web graph reveal about social processes which result in its creation and dynamics?

Power laws: How many pages point to a random page on the web?

|}{|)( uvuI 1.2))((Pr kkuIu

• Indegrees.

Slope = 2.1

How many links on a page?

Slope = 2.7

Yule/Pareto/Zipf and power laws.• Inverse polynomial tail. • Word frequency in text. Yule (later Mandelbrot) Statistical study of the literary

vocabulary.[Yule, 1944]. • Citation analysis [Lotka, 1926].• Zipf Human behavior and the principle of least effort. [Zipf, 1947].• Pareto Cours d’economie politique. [Pareto,1897] • Network graph. [Faloutsos-Faloutsos-Faloutsos, 1999]• Oligonucleotide sequences [Martindale-Konopka, 1996]• Many other instances.

More Germane

• Access statistics for web pages. (From server logs) [Glassman-97]

• User behavior (by instrumenting browsers and proxies) [Lukose-Huberman-98, Crovella and others,97-99]

• Earliest analytical model, [Herb Simon, 1955].

Co-citation and Bibliographic coupling: Signature of a community.

• Bipartite cores: small “complete” bipartite subgraphs.

• Bibliographic coupling, Co-citation analysis.• Hubs and Authorities.

)3,3(KUses:• Web searching (HITS/Clever).• Mining communities (Campfire project).• Backlink browsing, “find similar.”

Small world.

• (Small World Prediction) [Barabasi and Albert 99, Albert-Jeong-Barabasi 99]. Based on a simple model, predict that most pages are within 19 links of each other. Justify the model by crawling nd.edu

Facts (about the crawl).

• Most of the time (75%) a random page u is not reachable from another random page v.

• Indegree and Outdegree distributions satisfy the power law. Consistent over time and scale.

Component sizes.

• Component sizes are distributed by the power law.

Reachability

• How many vertices are reachable from a random vertex?

A Picture of (~200M) pages.

Part B: Interpretation

• Random graph theory.• Application 1: The Campfire Project.• Application 2: Classical IR/Learning.

Random Graphs• Erdos and Renyi’s model [Bollobas].• Graph with n vertices.• Each of n(n-1) arcs appear with probability p.• Graphical evolution [Palmer]: study properties of the resulting random graph

as p is increased from 0 to 1. • [Shelah and Spencer] 0-1 law: Most properties exhibit a threshold “phase

change” like behavior.

pnG ,

p

Facts about the Erdos-Renyi model

• A random graph with average degree 4 has a giant connected component containing almost all (90%) of the vertices.

• Indegrees and outdegrees are concentrated around the mean. And have exponentially declining tails.

• Most vertices in the graph are close to most others (small world).

A new random graph model.

Content creation hypothesis

• Some page creators create content without regard to what exists on the web.

• Many create pages which are inspired by pre-existing content.• Effectively, some links are random, others are copied from pre-

existing pages.

Probabilistic analysis: Evolving graphs.

• Creation and Deletion processes for nodes and edges.– e.g. at each time step, a new node is created with a fixed probability– at each time step, a new edge is created with probability

• links two random nodes with probability• a node in proportion to its indegree with probability• (copy a random link).

– At each time step a node (resp. edge) is deleted with probability (resp )

• Simple model: creation probabilities are 1 and deletion probabilities are 0.

vpep

1

updp

Theory

SCC. sizedlinear a hasgraph thesurely,Almost .3

limit. in the this toconvergeson distributi thesurely,Almost .2

))((Pr

:formula in the expressed becan

and ,on only dependent is tail theofexponent themoreover,

tail,polynomial inversean hason distributi indegree The 1.

)12(

kkuIu

Why study models?

• Good predictors of macroscopic behavior.– Degree distributions. Existence and number of cores. [WWW8]

• Algorithmic advantages (speed and accuracy).– Better and analyzable algorithmic methods. Inclusion-Exclusion pruning.

[VLDB].– Applications to Data Mining.

• Better understanding of the data/corpus.– What is “surprising” depends on what is typical. To find interesting stuff,

you must know what is expected.

Be careful about...

• Predicting and analyzing microscopic properties.– Microscopic Properties which can be changed by the addition/deletion of

a few nodes/edges/features. • Examples: Diameter and girth, rare terms and features. • Very susceptible to noise and systematic but small inconsistencies in

the model.– Macroscopic Major dataset surgery required to significantly alter the

property.• Examples: Degree distributions. Connectivity. • Law of large numbers or equivalent applies.

Application 1: The campfire project.

Co-citation: Signature of a community.

• Bipartite cores: small “complete” bipartite subgraphs.

. no containssurely almost

graph random sparse a then if Then, vertices.hand

right and verticeshandleft with core a denote Let

:Fact

),(

,

),(

ji

pn

ji

K

Gijji

jiK

)3,3(K

Campfire project

• Automatically find and organize communities on the web.• Approach:

– Find all cores.– Grow cores into the full community.– Do IR/Categorization/Clustering etc. to organize the community space.

[KRRT] WWW8, and [KRRT] VLDB’99.

The cores are interesting.

• hotels in costa rica• clipart• japanese elementary schools • turkish student associations• oil spills off the coast of japan• australian fire brigades• aviation/aircraft vendors• guitar manufacturers

•Yahoo!, Excite, Infoseek•webrings•news groups•mailing lists

Explicit communities.

Implicit communities

(1) Implicit communities are defined by cores.(2) There are an order of magnitute more of these. There are efficient heuristics to compute all cores.(3) Can grow the core to the communityusing Clever.

Costa Rican Hotels.• The Costa Rica Inte...ion on arts, busi... • Informatica Interna...rvices in Costa Rica • Cocos Island Research Center • Aero Costa Rica • Hotel Tilawa - Home Page • COSTA RICA BY INTER@MERICA • tamarindo.com • Costa Rica • New Page 5 • The Costa Rica Internet Directory. • Costa Rica, Zarpe Travel and Casa Maria • Si Como No Resort Hotels & Villas • Apartotel El Sesteo... de San José, Cos... • Spanish Abroad, Inc. Home Page • Costa Rica's Pura V...ry - Reservation ... • YELLOW\RESPALDO\HOTELES\Orquide1 • Costa Rica - Summary Profile • COST RICA, MANUEL A...EPOS: VILLA

• Hotels and Travel in Costa Rica • Nosara Hotels & Res...els &• Restaurants... • Costa Rica Travel, Tourism &• Resorts • Association Civica de Nosara • http://www...ca/hotels/mimos.html • Costa Rica, Healthy...t Pura Vida• Domestic & International Airline• HOTELES / HOTELS - COSTA RICA • tourgems • Hotel Tilawa - Links • Costa Rica Hotels T...On line• Reservations • Yellow pages Costa ...Rica Export• INFOHUB Costa Rica Travel Guide • Hotel Parador, Manuel Antonio, Costa Rica • Destinations

Elementary Schools in Japan• The American School in Japan • The Link Page • ‰ªès—§ˆä“c¬ŠwZƒz[ƒƒy[ƒW � � � � � � �• Kids' Space • ˆÀés—§ˆÀ鼕”¬ŠwZ � � � � � �• ‹{鋳ˆç‘åŠw•‘®¬ŠwZ � � � �• KEIMEI GAKUEN Home Page ( Japanese ) • Shiranuma Home Page • fuzoku-es.fukui-u.ac.jp • welcome to Miasa E&J school • _“Þ쌧E‰¡•ls—§’†ì¼¬ŠwZ‚̃y� � � � � � � �• http://www...p/~m_maru/index.html • fukui haruyama-es HomePage • Torisu primary school • goo • Yakumo Elementary,Hokkaido,Japan • FUZOKU Home Page • Kamishibun Elementary School...

• schools • LINK Page-13 • “ú–{‚ÌŠwZ �• a‰„¬ŠwZƒz[ƒƒy[ƒW � � � � � �• 100 Schools Home Pages (English) • K-12 from Japan 10/...rnet and Education ) • http://www...iglobe.ne.jp/~IKESAN • ‚l‚f‚j¬ŠwZ‚U”N‚P‘g•¨Œê � �• ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ � � � �• Koulutus ja oppilaitokset • TOYODA HOMEPAGE • Education • Cay's Homepage(Japanese) • –y“쬊wZ‚̃z[ƒƒy[ƒW � � � � �• UNIVERSITY • ‰J—³¬ŠwZ DRAGON97-TOP � �• ‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒƒy[ƒW � � � � � �• ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼

Application 2: Classical Learning/IR.

Vector space and other classical models.

• Document is a vector in a real-valued space with dimensions identified with “features.” [Salton]

features}{, FRx F

Some notion of similarity, usually, cosine or dot-product.

),cos(),( yxyxD

Built in assumption: Features are independent.

Uses of the Vector Space model.• Search, Clustering, Classification.• Term weighting. [Salton, Dumais, Sparck-Jones]• SVD (for instance, LSI [Deerwester et.al.]).• Gaussian assumption and classification. (for instance, [Koller-Sahami],

[Chakrabarti et.al.]).• Many ad-hoc methods and heuristics, some of which work remarkably well.

[Modha et.al.]• Clustering. [Drineas et.al.]• Dimensionality reduction. Feature selection. [Johnson-Lindenstrauss, Koller-

Sahami, Chakrabarti et.al. and others]

Two (new ?) ingredients.

• Hypertext -- the graph.• Zipfian distributions on term occurances.

Hypertext Classification/Clustering.

• Class of a page is a function of text + class of neighbor set. – Classification problem -- Markov Random fields. [Chakrabarti-Dom-Indyk]– Clustering problem -- [Modha]

Research Issue

• Rework applications in these new (rather old) context. OR• Explain why the standard algorithms continue to work despite the

sometime questionable assumptions behind their derivation.