Graph Mining and Social Network...
Transcript of Graph Mining and Social Network...
Daniele Loiacono
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Daniele Loiacono
References
q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management Systems (Second Edition) " Chapter 9
Graph Mining
Daniele Loiacono
Graph Mining Overview
q Graphs are becoming increasingly important to model many phenomena in a large class of domains (e.g., bioinformatics, computer vision, social analysis)
q To deal with these needs, many data mining approaches have been extended also to graphs and trees
q Major approaches " Mining frequent subgraphs " Indexing " Similarity search " Classification " Clustering
Daniele Loiacono
Mining frequent subgraphs
q Give n a labeled graph data set
D = {G1,G2,…,Gn}
q We define support(g) as the percentage of graphs in D where g is a subgraph
q A frequent subgraph in D is a subgraph with a support greater than min_sup
q How to find frequent subgraph? " Apriori-based approach " Pattern-growth approach
Daniele Loiacono
AprioriGraph
q Apply a level-wise iterative algorithm 1. Choose two similar size-k frequent subgraphs in S 2. Merge two similar subgraphs in a size-(k+1) subgraph 3. If the new subgraph is frequent add to S 4. Restart from 2. until all similar subgraphs have been
considered. Otherwise restart from 1. and move to k+1. q What is subgraph size?
" Number of vertex " Number of edges " Number of edge-disjoint paths
q Two subgraphs of size-k are similar if they have the same size-(k-1) subgraph
q AprioriGraph has a big computational cost (due to the merging step)
Daniele Loiacono
PatternGrowthGraph
q Incrementally extend frequent subgraphs 1. Add to S each frequent subgraphs gE obtained by
extending subgraph g 2. Until S is not empty, select a new subgraph g in S to
extend and start from 1. q How to extend a subgraph?
" Add a vertex " Add an edge
q The same graph can be discovered many times! " Get rid of duplicates once discovered " Reduce the generation of duplicates
Daniele Loiacono
Mining closed, unlabeled, and constrained subgraphs
q Closed subgraphs " G is closed iff there is no proper supergraph G’ with the
same support of G " Reduce the growth of subgraphs discovered " Is a more compact representation of knowledge
q Unlabeled (or partially labeled) graphs " Introduce a special label Φ " Φ can match any label or only itself
q Constrained subgraphs " Containment constraint (edges, vertex, subgraphs) " Geometric constraint " Value constraint
Daniele Loiacono
Graph Indexing
q Indexing is basilar for effective search and query processing q How to index graphs? q Path-based approach takes the path as indexing unit
" All the path up to maxL length are indexed " Does not scale very well
q gIndex approach takes frequent and discriminative subgraphs as indexing unit " A subgraph is frequent if it has a support greater than a
threshold " A subgraph is discriminative if its support cannot be well
approximated by the intersection of the graph sets that contain one of its subgraphs
Daniele Loiacono
Graph Classification and Clustering
q Mining of frequent subgraphs can be effectively used for classification and clustering purposes
q Classification " Frequent and discriminative subgraphs are used as features to
perform the classification task " A subgraph is discriminative if it is frequent only in one class of
graphs and infrequent in the others " The threshold on frequency and discriminativeness should be
tuned to obtain the desired classification results q Clustering
" The mined frequent subgraphs are used to define similarity between graphs
" Two graphs that share a large set of patterns should be considered similar and grouped in the same cluster
" The threshold on frequency can be tuned to find the desired number of clusters
q As the mining step affects heavily the final outcome, this is an intertwined process rather tan a two-steps process
Social Network Analysis
Daniele Loiacono
Social Network
q A social network is an heterogeneous and multirelational dataset represented by a graph " Vertexes represent the objects (entities) " Edges represent the links (relationships or interaction) " Both objects and links may have attributes " Social networks are usually very large
q Social network can be used to represents many real-world phenomena (not necessarily social) " Electrical power grids " Phone calls " Spread of computer virus " WWW
Daniele Loiacono
Small World Networks (1)
q Are social networks random graphs? q NO!
Internet Map Science
Coauthorship Protein Network
Few degrees of separation
High degree of local clustering
Daniele Loiacono
Peter Jane
Sarah
Ralph Society: Six degrees S. Milgram 1967 F. Karinthy 1929 WWW: 19 degrees Albert et al. 1999
Small World Networks (2)
Daniele Loiacono
Small World Networks (3)
q Definitions " Node’s degree is the number of incident edges " Network effective diameter is the max distance within 90% of
the network
q Properties " Densification power law
" Shrinking diameter " Heavy-tailed degrees distribution
n: number of nodes e: number of eges 1<α<2
Node rank
Nod
e deg
rees
Preferential attachment model
Daniele Loiacono
Mining social networks (1)
q Several Link mining tasks can be identified in the analysis of social networks
q Link based object classification " Classification of objects on the basis of its attributes, its
links and attributes of objects linked to it " E.g., predict topic of a paper on the basis of
• Keywords occurrence • Citations and cocitations
q Link type prediction " Prediction of link type on the basis of objects attributes " E.g., predict if a link between two Web pages is an
advertising link or not q Predicting link existence
" Predict the presence of a link between two objects
Daniele Loiacono
Mining social networks (2)
q Link cardinality estimation " Prediction of the number of links to an object " Prediction of the number of objects reachable from a
specific object q Object reconciliation
" Discover if two objects are the same on the basis of their attributes and links
" E.g., predict if two websites are mirrors of each other q Group detection
" Clustering of objects on the basis both of their attributes and their links
q Subgraph detection " Discover characteristic subgraphs within network
Daniele Loiacono
Challenges
q Feature construction " Not only the objects attributes need to be considered but
also attributes of linked objects " Feature selection and aggregation techniques must be
applied to reduce the size of search space q Collective classification and consolidation
" Unlabeled data cannot be classified independently " New objects can be correlated and need to be considered
collectively to consolidate the current model q Link prediction
" The prior probability of link between two objects may be very low
q Community mining from multirelational networks " Many approaches assume an homogenous relationship
while social networks usually represent different communities and functionalities
Daniele Loiacono
Applications
q Link Prediction q Viral Marketing
Daniele Loiacono
Link prediction
q What edges will be added to the network? q Given a snapshot of a network at time t, link prediction aims
to predict the edges that will be added before a given future time t’
q Link prediction is generally solved assigning to each pair of nodes a weight score(X,Y)
q The higher the score the more likely that link will be added in the near future
q The score(X,Y) can be computed in several way " Shortest path: the shortest he path between X and Y the
highest is their score " Common neighbors: the greater the number of neighbors
X and Y have in common, the highest is their score " Ensemble of all paths: weighted sum of paths that
connects X and Y (shorter paths have usually larger weights)
Daniele Loiacono
Viral Marketing
q Several marketing approaches " Mass marketing is targeted on specific segment of
customers " Direct marketing is target on specific customers solely on
the basis of their characteristics " Viral marketing tries to exploit the social connections to
maximize the output of marketing actions q Each customer has a specific network value based on
" The number of connections " Its role in the network (e.g., opinion leader, listener) " Role of its connections
q Viral marketing aims to exploit the network value of customers to predict their influence and to maximize the outcome of marketing actions
Daniele Loiacono
Viral Marketing: Random Spreading
q 500 randomly chosen customers are given a product (from 5000).
Daniele Loiacono
Viral Marketing: Directed Spreading
q The 500 most connected consumers are given a product.