Chapter 7. Ventilation Network Analysis 7.1. INTRODUCTION ...
Introduction to Network Analysis
description
Transcript of Introduction to Network Analysis
Introduction to Network Analysis
Marko Grobelnik, Dunja Mladenic JSI
Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure Leskovec, CMU/JSI
What are networks?◦ …few examples
Network properties◦ Small worlds◦ Power law ◦ Long tail◦ Network Resilience◦ Structure of networks
Applications◦ Mining e-mail server logs◦ Mining MSN Messenger data
Outline
3
Networks & Computer Science
Statistics
Computer systems
Theory and
algorithms
(complex) networks
Machine learning / Data mining
4
Networks & Science
Statistics
Computer systems
Theory and algorithms
(complex) networks
Machine learning /
Data mining
Social Sciences
Biology
Physics
(complex)
networks
Industry & Applications
Computer
Science
Networks (graphs)Vertex / Node
Networks (graphs)Vertex / Node
Edge/ Link
Networks (graphs)Vertex / Node
Edge/ Link
Direction
Networks (graphs)Vertex / Node
Edge/ Link
Direction0.3
0.60.1
Probabilities
Dynamic Networks (graphs)
Vertex / Node
Edge/ Link
Direction0.3
0.60.1
Probabilities
…in dynamic networks all the elements of the graph are changing
…dealing with dynamic networks is active research topic
Query
Active topicduring limited time period
Example of Dynamic Graph (1/3)
On 1996-08-30Clinton and Chicago are connected
Example of Dynamic Graph (2/3)
On 1996-10-02Clinton and Chicago are NOT connected
Example of Dynamic Graph (3/3)
Networks of the real-world (1) Information networks:
◦ World Wide Web: hyperlinks◦ Citation networks◦ Blog networks
Social networks: people + interactios◦ Organizational networks◦ Communication networks ◦ Collaboration networks◦ Sexual networks ◦ Collaboration networks
Technological networks:◦ Power grid◦ Airline, road, river networks◦ Telephone networks◦ Internet◦ Autonomous systems
Florence families Karate club network
Collaboration networkFriendship network
Networks of the real-world (2)
Biological networks◦ metabolic networks◦ food web◦ neural networks◦ gene regulatory
networks Language networks
◦ Semantic networks Software networks …
Yeast proteininteractions
Semantic network
Language network XFree86 network
Types of networks Directed/undirected Multi graphs (multiple edges between nodes) Hyper graphs (edges connecting multiple
nodes) Bipartite graphs (e.g., papers to authors) Weighted networks Different type nodes and edges Evolving networks:
◦ Nodes and edges only added◦ Nodes, edges added and removed
Traditional approach Sociologists were first to study networks:
◦ Study of patterns of connections between people to understand functioning of the society
◦ People are nodes, interactions are edges ◦ Questionares are used to collect link data (hard to
obtain, inaccurate, subjective)◦ Typical questions: Centrality and connectivity
Limited to small graphs (~10 nodes) and properties of individual nodes and edges
New approach (1) Large networks (e.g., web, internet, on-line
social networks) with millions of nodes Many traditional questions not useful anymore:
◦ Traditional: What happens if a node U is removed? ◦Now: What percentage of nodes needs to be
removed to affect network connectivity? Focus moves from a single node to study of
statistical properties of the network as a whole
Can not draw (plot) the network and examine it
New approach (2) How the network “looks like” even if I can’t
look at it? Need for statistical methods and tools to
quantify large networks 3 parts/goals:
◦ Statistical properties of large networks◦ Models that help understand these properties◦ Predict behavior of networked systems based on
measured structural properties and local rules governing individual nodes
Statistical properties of networks Features common to networks of different
types:◦ Properties of static networks:
Small-world effect Transitivity or clustering Degree distributions (scale free networks) Network resilience Community structure Subgraphs or motifs
◦ Temporal properties: Densification Shrinking diameter
Small-world effect Six degrees of separation (Milgram 60s)
◦ Random people in Nebraska were asked to send letters to stockbrokes in Boston
◦ Letters can only be passed to first-name acquantices
◦ Only 25% letters reached the goal◦ But they reached it in about 6 steps
Measuring path lengths: ◦ Diameter (longest shortest path): max dij◦ Effective diameter: distance at which 90% of all
connected pairs of nodes can be reached◦ Mean geodesic (shortest) distance l
Small World Networks on Web Empirical observation for the Web-Graph is
that the diameter of the Web-Graph is small relative to the size of the network◦ …this property is called “Small World”◦ …formally, small-world networks have diameter
exponentially smaller then the size By simulation it was shown that for the
Web-size of 1B pages the diameter is approx. 19 steps◦ …empirical studies confirmed the findings
Small World on FP5-IST (collaboration network) The network represents collaboration between institutions on
FP5-IST projects funded by European Union◦ …there are 7886 organizations collaborating on 2786 projects◦ …in the network, each node is an organization, two
organizations are connected if they collaborate on at least one project
Small world properties of the collaboration network:◦ Main connected part of the network contains 94% of the
nodes◦ Max distance between any two organizations is 7 steps …
meaning that any organization can be reached in up to 7 steps from any other organization
◦ Average distance between any two organizations is 3.15 steps (with standard deviation 0.38)
◦ 38% (2770) of organizations have avg. distance 3 or less
• 1856 collaborations• avg. distance is 1.95• max. distance is 4
Connectedness of the most connected institution
• 179 collaborations • avg. distance is 2.42• max. distance is 4
Connectedness of semi connected institution
• 8 collaborations • max. distance is 7
Connectedness of min. connected institution
Small World effect on MSN Messenger Network Distribution of
shortest path lengths
Microsoft Messenger network ◦ 180 million people◦ 1.3 billion edges◦ Edge if two people
exchanged at least one message in one month period
0 5 10 15 20 25 3010
0
101
102
103
104
105
106
107
108
Distance (Hops)
Num
ber o
f nod
es
Pick a random node, count how many
nodes are at distance
1,2,3... hops7
What is Power Law? Power law describes relations between the
objects in the network◦ …it is very characteristic for the networks
generated within some kind of social process◦ …it describes scale invariance found in many
natural phenomena (including physics, biology, sociology, economy and linguistics)
Power-Law on the Web In the context of Web the power-law appears in many
cases:◦ Web pages sizes◦ Web page connectivity◦ Web connected components’ size◦ Web page access statistics◦ Web Browsing behavior
Formally, power law describing web page degrees are:
(This property has been preserved as the Web has grown)
Degree distribution
number of people a person talks to on a
Microsoft Messenger
Node degree
Cou
nt
X
Highest degree
Detour: how long is the long tail?
This is not directly related to graphs, but it nicely explains
the “long tail” effect. It shows that there is big
market for niche products.
Network resilience We observe how the
connectivity (length of the paths) of the network changes as the vertices get removed
It is important for epidemiology◦ Removal of vertices
corresponds to vaccination Real-world networks are
resilient to random attacks◦ One has to remove all web-
pages of degree > 5 to disconnect the web
◦ …but this is a very small percentage of web pages
Random network has better resilience to targeted attacks
Network motifs (1) What are the building blocks (motifs) of
networks? Do motifs have specific roles in networks? Network motifs detection process:
◦ Count how many times each subgraph appears◦ Compute statistical significance for each
subgraph – probability of appearing in random as much as in real network
3 node motifs
Network motifs (2) Biological networks
◦ Feed-forward loop◦ Bi-fan motif
Web graph:◦ Feedback with two
mutual diads◦ Mutual diad◦ Fully connected triad
Shrinking diameters Intuition says that
distances between the nodes slowly grow as the network grows (like log n)
But as the network grows the distances between nodes slowly decrease
Internet
Citations
Structure of the Web – “Bow Tie” model In November 1999 large scale study using
AltaVista crawls in the size of over 200M nodes and 1.5B links reported “bow tie” structure of web links◦ …we suspect, because of the scale free nature of
the Web, this structure is still preserved
SCC - Strongly Connected component where pages can reach each other via
directed paths
IN – consisting from pages that can reach
core via directed path, but cannot be reached
from the core
OUT – consisting from pages that can be
reached from the core via directed path, but cannot reach core in a
similar way
TENDRILS – disconnected components reachable only via
directed path from IN and OUT but not from and to
core
TENDRILS – disconnected components reachable only via
directed path from IN and OUT but not from and to
core
Mining email server logs
Ontology generation from social networks data We address the problem how to construct a
taxonomy from a social network data.◦ …we adapt the approach used when dealing with
text As an example we use e-mail graph in a mid
size research institution◦ ...communication records of JSI 770 people
The experiments and evaluation show our approach to be useful and applicable in real life situations◦ …the approach could be easily reused in case
studies (and elsewhere)
Architecture The main contribution of the deliverable is architecture & software
consisting from 5 major steps:1. Starting with log files from the institutional e-mail server where the data
include information about e-mail transactions with three fields: time, sender and the list of receivers.
2. After cleaning we get the data in the form of e-mail transactions which include e-mail addresses of sender and receiver.
3. From a set of e-mail transactions we construct a graph where vertices are e-mail addresses connected if there is a transaction between them
4. E-mail graph is transformed into a sparse matrix allowing to perform data manipulation and analysis operations
5. Sparse matrix representation of the graph is analyzed with ontology learning tools producing an ontological structure corresponding to the organizational structure of the institution where e-mails came from.
Data used for Experimentation The data is the collection of log files with e-
mail transactions from local e-mail spam filter software Amavis (http://www.amavis.org/):◦ Each line of the log files denotes one event at the
spam filter software◦ We were interested in the events on successful e-
mail transactions ...having information on time, sender, and list of
receivers ◦ An example of successful e-mail transaction is the
following line: 2005 Mar 28 13:59:05 patsy amavis[33972]: (33972-01-3) Passed CLEAN, [217.32.164.151] [193.113.30.29] <[email protected]> -> <[email protected]>, Message-ID: <21DA6754A9238B48B92F39637EF307FD0D4781C8@i2km41-ukdy.domain1.systemhost.net>, Hits: -1.668, 6389 ms
Some statistics about the data The log files include e-mails data from Sep
5th 2003 to Mar 28th 2005:◦ …this sums up to 12.8Gb of data. ◦ After filtering out successful e-mail transactions
it remains 564Mb …which contains approx. 2.7 million of successful e-
mail transitions used for further processing◦ The whole dataset contains references to
approx. 45000 e-mail addresses …after the data cleaning phase the number is reduced
to approx. 17000 e-mail addresses …out of which 770 e-mail addresses are internal from
the home institution (with “ijs.si” domain name)
Organizational structure of JSI produced from cleaned e-mail transactions with OntoGen in <5 minutes
Organizational structure of JSI visualized from e-mail transactions with Document-Atlas
EvaluationPart of clustering results for “Jozef Stefan Institute” e-mail data into 10 clusters (C-0, C-1, …C-9) showing distribution of the clustered e-mails over the Institute departments.
Analysis of MSN Messenger Communication Network
By Jure Leskovec
Data that we have: Communication
For every conversation (session) we have a list of users who participated in the conversation
There can be multiple people per conversation
For each conversation and each user:◦ User Id◦ Time Joined◦ Time Left◦ Number of Messages Sent◦ Number of Messages Received
Data that we have: Demographics For every user (self reported):
◦ Age◦ Gender◦ Location (Country, ZIP)◦ Language◦ IP address (we can do reverse GeoIP lookup)
Facts about the data 150 GB compressed logs per day
◦ Just copying over the network takes 8 to 10 hours◦ Parsing and processing takes another 4 to 6 hours
After parsing, collapsing, saving as binary and compressing ~ 40GB per day
Collected data for all of June 2006: 1.3TB of data
User age distribution (self reported)
Age
Coun
t
Number of participants in the conversation
Conversation size
Coun
t
Limit of 20 users per session
Activity per day Data for June 1:
◦ 982,005,323 sessions (conversations)◦ 980,219,231 2-user conversations◦ 471,837,591 conversations with 0 exchanged
messages ◦ 508,315,719 “good” sessions◦ 63,949,711 different users talking◦ 65,921 unknown users talking (users which never
login)
Data statistics Over June 2006: 242,720,596 users logged in 179,792,538 users engaged in
conversations 17,510,905 new users (never logged in
before) More than 30 billion conversations
Age: Number of conversations
Age
Age
High
Low
Age: Conversation duration
High
LowAge
Age
Age: Sent messages per session
High
LowAge
Age
Age: Messages per secondHigh
LowAge
Age
Where are the users coming from?
Communication network Using only 2-user conversations from
June 2006 we build a graph:◦ 179,792,538 nodes◦ 1,342,246,427 edges◦ 15,010,572,090 2-user conversations
7-degrees of separation
0 5 10 15 20 25 3010
0
101
102
103
104
105
106
107
108
Distance (Hops)
Num
ber o
f nod
es
Pick a random node, count how many nodes are at distance 1,2,3... hops
Hops Nodes1 10
2 78
3 396
4 8648
5 3299252
6 28395849
7 79059497
8 52995778
9 10321008
10 1955007
11 518410
12 149945
13 44616
14 13740
15 4476
16 1542
17 536
18 167
19 71
20 29
21 16
22 10
23 3
24 2
25 3
In ACTIVE we will perform analytics along three main dimensions:◦ content (text, tags, semi-structured data)◦ social network (graph of social linkages)◦ time
Content dimensions is well studies and covered by many text-mining methods
…static social network analysis aspect will be covered well by the existing methods
…core research will happen on “dynamic social networks”
…relation to ACTIVE project
Network analysis is very active research topic on the intersection of several areas◦ …the area deals primarily with graph
representation, fundamental to many problems in the nature and society
◦ …currently hot research topic in network analysis is dealing with “dynamic networks”
◦ …in ACTIVE we will perform research and provide solutions for large dynamic social networks extracted from enterprise data
Conclusion