Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management...
-
Upload
ezra-mccarthy -
Category
Documents
-
view
214 -
download
0
Transcript of Ch. 13 Structure of the Web Padmini Srinivasan Computer Science Department Department of Management...
Ch. 13 Structure of the Web
Padmini Srinivasan
Computer Science Department Department of Management Sciences
http://cs.uiowa.edu/[email protected]
Origins
• Origins of WWW (1989/1990: http)– Sir Tim Berners-Lee & Robert Cailliau
• First prototype of browser: WorldWideWeb• 1st popular graphical browser: Mosaic (NCSA), Marc
Andreessen and others– Mozilla -> Netscape -> Firefox
• Lynx• 2000 Windows explorer• WAIS, Gopher, Veronica, • 1994: W3C• 1993: 1st World wide web conference• 1995: Yahoo! 1998: Google 2006: Live Search -> Bing
Network Metaphor
• Information network: – Different from social network
• Notion of a logical document: different – Decentralized, over many computers– annotation
• Network metaphor: “inspired and non-obvious”• Origins in hypertext – origins in citation nets
• Citation nets: distinctly temporal, web?– Citation maps (popular) co-citation; bibliographic coupling;
• H-index (Hirsch); g-index; f-index
– Patents; legal cases (precedents); medical literature• Indexes: cross-linkages; see also; wikipedia
Links/Associations• Directed edges,
– Friendship nets, name-recognition, business colleagues, collaboration [Erdos number, Bacon number], IM nets, email graphs etc.
– paths, shortest paths…• Associative memory• Semantic nets aka Conceptual networks (free-association studies)• Vannevar Bush “As We May Think” (1945) Atlantic Monthly.
WW2. MEMEX (on web)– Associative connections between all of knowledge– Acknowledged by most– A way to rechannel human resources
Paths and Connectivity
• Connected graphs• Path: sequence of nodes beginning at node X and
ending at node Y.• A directed graph is strongly connected if there is a
path (directed of course) between every pair of its nodes.
• If it is not strongly connected, need to examine its ‘reachability’ properties.– Easier in an undirected graph: disconnected components– Directed? Find strongly connected components
Strongly Connected Component
• SCC in a directed graph is a subset of nodes such that – (1) every node in it has a path to every other node
in it– (2) the subset is not a part of a larger set of nodes
that has the same property. [So it is the largest such component]
• Why is it interesting to know about such components in the Web?
Bow-Tie Structure of the Web
• 1999 Andrei Broder (now Yahoo!), then Alta Vista• SCC; IN; OUT; Tendrils; Tubes, Disconnected• Macro-model– Properties of a reasonable model:
• Should have a succinct and fairly natural description• Rooted in plausible macro-level process for creation of
Web content• Not require some prior static set of topics• Should reflect many of the structural phenomenon
observed in the Web
Similar Studies
• Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are
• Webbase, 200 Million Stanford crawl– 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48
million) next SCC: 10 thousand!
Similar Studies
• Buriol et al. (includes Donato): Temporal analysis of Wikigraph.
Bow-Tie
• Why a single SCC? Why not two large ones?• Any other explanations?– Interlinked world?– Hard to be disconnected?– What about a new page?
• Is the SCC static/fixed? How does it change?– Are links permanent? (2004: 25% remain after 1 year and
50% of pages stay the same; Ntoulas et al., 2004)• Many naturally occurring graphs have a giant SCC– IM (nodes people, link message) almost all are in the SCC;
median path length is 7,mean 6.6.
Bow-Tie: points to note
• Incomplete picture– Doesn’t tell you how this is generated, just that it is.– Macro model:
• Thematic collections; differences?• Organization specific collections• Regional: economic incentives/disincentives?• Community based: education levels?• Bipartite cliques (small sized – many in number)
– Fans pointing to centers
– Will it always be observed? How about now?
Web 2.0
• “an attitude not a technology”– Collaboration/collective maintenance• Annotation, tags, links, editing, revisions
– Data generated by individuals for individual and group sharing; Flickr, Gmail.
– Connections between entities beyond “documents”.
• Social feedback key; ‘wisdom of crowds’; long tail;
Web Links
• Navigational – static pages – passive services• Transactional – dynamic / computational
services. Deep web• Search engines – heuristics– What kinds of rules would you use?– Implications for crawlers
Summary
• Web: origins, network metaphor– Citations, MEMEX
• Paths• Structures (macro)– SCC– Bow-Tie model
• Next– Ch 14: Hubs and Authorities; PageRank