Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management...

14
Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences http://cs.uiowa.edu/~psriniva [email protected]

Transcript of Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management...

Page 1: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Ch. 13 Structure of the Web

Padmini Srinivasan

Computer Science Department Department of Management Sciences

http://cs.uiowa.edu/[email protected]

Page 2: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Origins

• Origins of WWW (1989/1990: http)– Sir Tim Berners-Lee & Robert Cailliau

• First prototype of browser: WorldWideWeb• 1st popular graphical browser: Mosaic (NCSA), Marc

Andreessen and others– Mozilla -> Netscape -> Firefox

• Lynx• 2000 Windows explorer• WAIS, Gopher, Veronica, • 1994: W3C• 1993: 1st World wide web conference• 1995: Yahoo! 1998: Google 2006: Live Search -> Bing

Page 3: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Network Metaphor

• Information network: – Different from social network

• Notion of a logical document: different – Decentralized, over many computers– annotation

• Network metaphor: “inspired and non-obvious”• Origins in hypertext – origins in citation nets

• Citation nets: distinctly temporal, web?– Citation maps (popular) co-citation; bibliographic coupling;

• H-index (Hirsch); g-index; f-index

– Patents; legal cases (precedents); medical literature• Indexes: cross-linkages; see also; wikipedia

Page 4: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Links/Associations• Directed edges,

– Friendship nets, name-recognition, business colleagues, collaboration [Erdos number, Bacon number], IM nets, email graphs etc.

– paths, shortest paths…• Associative memory• Semantic nets aka Conceptual networks (free-association studies)• Vannevar Bush “As We May Think” (1945) Atlantic Monthly.

WW2. MEMEX (on web)– Associative connections between all of knowledge– Acknowledged by most– A way to rechannel human resources

Page 5: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Paths and Connectivity

• Connected graphs• Path: sequence of nodes beginning at node X and

ending at node Y.• A directed graph is strongly connected if there is a

path (directed of course) between every pair of its nodes.

• If it is not strongly connected, need to examine its ‘reachability’ properties.– Easier in an undirected graph: disconnected components– Directed? Find strongly connected components

Page 6: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Strongly Connected Component

• SCC in a directed graph is a subset of nodes such that – (1) every node in it has a path to every other node

in it– (2) the subset is not a part of a larger set of nodes

that has the same property. [So it is the largest such component]

• Why is it interesting to know about such components in the Web?

Page 7: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Bow-Tie Structure of the Web

• 1999 Andrei Broder (now Yahoo!), then Alta Vista• SCC; IN; OUT; Tendrils; Tubes, Disconnected• Macro-model– Properties of a reasonable model:

• Should have a succinct and fairly natural description• Rooted in plausible macro-level process for creation of

Web content• Not require some prior static set of topics• Should reflect many of the structural phenomenon

observed in the Web

Page 8: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Similar Studies

• Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are

• Webbase, 200 Million Stanford crawl– 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48

million) next SCC: 10 thousand!

Page 9: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Similar Studies

• Buriol et al. (includes Donato): Temporal analysis of Wikigraph.

Page 10: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Bow-Tie

• Why a single SCC? Why not two large ones?• Any other explanations?– Interlinked world?– Hard to be disconnected?– What about a new page?

• Is the SCC static/fixed? How does it change?– Are links permanent? (2004: 25% remain after 1 year and

50% of pages stay the same; Ntoulas et al., 2004)• Many naturally occurring graphs have a giant SCC– IM (nodes people, link message) almost all are in the SCC;

median path length is 7,mean 6.6.

Page 11: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Bow-Tie: points to note

• Incomplete picture– Doesn’t tell you how this is generated, just that it is.– Macro model:

• Thematic collections; differences?• Organization specific collections• Regional: economic incentives/disincentives?• Community based: education levels?• Bipartite cliques (small sized – many in number)

– Fans pointing to centers

– Will it always be observed? How about now?

Page 12: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Web 2.0

• “an attitude not a technology”– Collaboration/collective maintenance• Annotation, tags, links, editing, revisions

– Data generated by individuals for individual and group sharing; Flickr, Gmail.

– Connections between entities beyond “documents”.

• Social feedback key; ‘wisdom of crowds’; long tail;

Page 13: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Web Links

• Navigational – static pages – passive services• Transactional – dynamic / computational

services. Deep web• Search engines – heuristics– What kinds of rules would you use?– Implications for crawlers

Page 14: Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu.

Summary

• Web: origins, network metaphor– Citations, MEMEX

• Paths• Structures (macro)– SCC– Bow-Tie model

• Next– Ch 14: Hubs and Authorities; PageRank