Data Mining-Graph Mining

download Data Mining-Graph Mining

of 9

description

Data Mining-Graph Mining

Transcript of Data Mining-Graph Mining

  • GRAPH MINING

    Graph Mining Graphs

    Model sophisticated structures and their interactions Chemical Informatics Bioinformatics Computer Vision Video Indexing Text Retrieval Web Analysis Social Networks

    Mining frequent sub-graph patterns Characterization, Discrimination, Classification

    and Cluster Analysis, building graph indices and similarity search

    Mining Frequent Subgraphs Graph g

    Vertex Set V(g) Edge set E(g) Label function maps a vertex / edge to a label Graph g is a sub-graph of another graph g if there

    exists a graph iso-morphism from g to g Support(g) or frequency(g) number of graphs in

    D = {G1, G2,..Gn} where g is a sub-graph Frequent graph satisfies min_sup

    Discovery of Frequent Substructures Step 1: Generate frequent sub-structure candidates Step 2: Check for frequency of each candidate

    Involves sub-graph isomorphism test which is

  • computationally expensive Approaches

    Apriori based approach Pattern Growth approach

    Apriori based Approach Start with graph of small size generate candidates with extra vertex/edge or path Apriori Approach AGM (Apriori-based Graph Mining)

    Vertex based candidate generation increases sub structure size by one vertex at each step

    Two frequent k size graphs are joined only if they have the same (k-1) subgraph (Size number of vertices)

    New candidate has (k-1) sized component and the additional two vertices Two different sub-structures can be formed

    FSG (Frequent Sub-graph mining) Edge-based Candidate generation increases by

    one-edge at a time Two size k patterns are merged iff they share the

    same subgraph having k-1 edges (core) New candidate has core and the two additional

    edges Edge disjoint path method

    Classify graphs by number of disjoint paths they have

    Two paths are edge-disjoint if they do not share any common edge

    A substructure pattern with k+1 disjoint paths is generated by joining sub-structures with k disjoint paths

  • Disadvantage of Apriori Approaches Overhead when joining two sub-structures Uses BFS strategy : level-wise candidate

    generation To check whether a k+1 graph is frequent it

    must check all of its size-k sub graphs May consume more memory

    Pattern-Growth Approach Uses BFS as well as DFS A graph g can be extended by adding a new edge e.

    The newly formed graph is denoted by g x e. Edge e may or may not introduce a new vertex to

    g. If e introduces a new vertex, the new graph is

    denoted by g xf e, otherwise, g xb e, where f or b indicates that the extension is in a forward or backward direction.

    Pattern Growth Approach For each discovered graph g performs extensions

    recursively until all frequent graphs with g are found

    Simple but inefficient Same graph is discovered multiple times

    duplicate graph

    Pattern Growth in gSpan Algorithm Reduces generation of duplicate graphs

    Does not extend duplicate graphs Uses Depth First Order A graph may have several DFS-trees

    Visiting order of vertices forms a linear order - Subscript

  • In a DFS tree starting vertex root; last visited vertex right-most vertex Path from v0 to vn right most path

    gSpan Algorithm gSpan restricts the extension method

    A new edge e can be added between the right-most vertex and another

    vertex on the right-most path (backward extension);

    or it can introduce a new vertex and connect to a vertex on the right-most path (forward extension)

    Right-most extension, denoted by G r e Chooses any one DFS tree base subscripting and

    extends it Each subscripted graph is transformed into an

    edge sequence DFS code Select the subscript that generates minimum

    sequence Edge Order maps edges in a subscripted

    graph into a sequence Sequence Order builds an order among edge

    sequences Root Empty code Each node is a DFS code encoding a graph Each edge rightmost extension from a (k-1) length

    DFS code to a k-length DFS code If codes s and s encode the same graph search

    space s can be safely pruned

    gSpan Algorithm Mining Closed Frequent Substructures Helps to overcome the problem of pattern explosion

  • A frequent graph G is closed if and only if there is no proper super graph G0 that has the same support as G. Closegraph Algorithm

    A frequent pattern G is maximal if and only if there is no frequent super-pattern of G.

    Maximal pattern set is a subset of the closed pattern set. But cannot be used to reconstruct entire set of

    frequent patterns

    Mining Alternative Substructure Patterns Mining unlabeled or partially labeled graphs

    New empty label is assigned to vertices and edges that do not have labels

    Mining non-simple graphs A non simple graph may have a self-loop and

    multiple edges growing order - backward edges, self-loops, and

    forward edges To handle multiple edges - allow sharing of the

    same vertices in two neighboring edges in a DFS code

    Mining directed graphs 6-tuple (i; j; d; li; l(i; j) ; lj ); d = +1 / -1

    Mining disconnected graphs Graph / Pattern may be disconnected Disconnected Graph Add virtual vertex Disconnected graph pattern set of connected

    graphs Mining frequent subtrees

    Tree Degenerate graph

  • Constraint based Mining of Substructure Patterns Element, set, or subgraph containment constraint

    user requires that the mined patterns contain a particular set of subgraphs - Succinct constraint

    Geometric constraint A geometric constraint can be that the angle

    between each pair of connected edges must be within a range Anti-monotonic constraint

    Value-sum constraint the sum_of (positive) weights on the edges, must

    be within a range low and high (sum > low) Monotonic / Anti-monotonic (sum < high)

    Multiple categories of constraints may also be enforced

    Mining Approximate Frequent Substructures Approximate frequent substructures allow slight

    structural variations Several slightly different frequent substructures

    can be represented using one approximate substructure

    SUBDUE Substructure discovery system based on the Minimum Description Length

    (MDL) principle adopts a constrained beam search SUBDUE performs approximate matching

    Mining Coherent and Dense Sub structures A frequent substructure G is a coherent sub graph if

    the mutual information between G and each of its own sub graphs is above some threshold Reduces number of patterns mined Application: coherent substructure mining selects

  • a small subset of features that have high distinguishing power between protein classes.

    Relational graph each label is used only once Frequent highly connected or dense subgraph

    mining People with strong associations in OSNs Set of genes within the same functional

    module Cannot judge based on average degree or minimal

    degree Must ensure connectedness Example: Average degree: 3.25 Minimum degree 3

    Mining Dense Substructures Dense graphs defined in terms of Edge Connectivity

    Given a graph G, an edge cut is a set of edges Ec such that E(G) - Ec is disconnected. A minimum cut is the smallest set in all edge

    cuts. The edge connectivity of G is the size of a

    minimum cut. A graph is dense if its edge connectivity is no less

    than a specified minimum cut threshold Mining Dense substructures

    Pattern-growth approach called Close-Cut (Scalable) starts with a small frequent candidate graph

    and extends it until it finds the largest super graph with the same support

    Pattern-reduction approach called Splat (High performance) directly intersects relational graphs to obtain

  • highly connected graphs A pattern g discovered in a set is

    progressively intersected with subsequent components to give g Some edges in g may be removed The size of candidate graphs is reduced by

    intersection and decomposition operations.

    Applications Graph Indexing Indexing is essential for efficient search and query

    processing Traditional approaches are not feasible for graphs

    Indexing based on nodes / edges / sub-graphs Path based Indexing approach

    Enumerate all the paths in a database up to maxL length and index them

    Index is used to identify all graphs with the paths in query

    Not suitable for complex graph queries Structural information is lost when a query

    graph is broken apart Many false positives maybe returned

    gIndex considers frequent and discriminative substructures as index features A frequent substructure is discriminative if its

    support cannot be approximated by the intersection of the graph sets

    Achieves good performance at less cost

    Graph Indexing Substructure Similarity Search Bioinformatics and Chem-informatics applications

    involve query based search in massive complex structural data

  • Substructure Similarity Search Grafil (Graph Similarity Filtering)

    Feature based structural filtering Models each query graph as a set of features

    Edge deletions feature misses Too many features reduce performance Multi-filter composition strategy

    Feature Set - group of similar features

    Classification and Cluster Analysis using Graph Patterns Graph Classification

    Mine frequent graph patterns Features that are frequent in one class but less

    in another Discriminative features Model construction

    Can adjust frequency, connectivity thresholds SVM, NBM etc are used

    Cluster Analysis Cluster Similar graphs based on graph

    connectivity (minimal cuts) Hierarchical clusters based on support threshold Outliers can also be detected

    Inter-related process

    GRAPH MININGGraph Mining Graphs Model sophisticated structures and their interactions Chemical Informatics Bioinformatics Computer Vision Video Indexing Text Retrieval Web Analysis Social Networks

    Mining frequent sub-graph patterns Characterization, Discrimination, Classification and Cluster Analysis, building graph indices and similarity search

    Mining Frequent Subgraphs Graph g Vertex Set V(g) Edge set E(g) Label function maps a vertex / edge to a label Graph g is a sub-graph of another graph g if there exists a graph iso-morphism from g to g Support(g) or frequency(g) number of graphs in D = {G1, G2,..Gn} where g is a sub-graph Frequent graph satisfies min_sup

    Discovery of Frequent Substructures Step 1: Generate frequent sub-structure candidates Step 2: Check for frequency of each candidate Involves sub-graph isomorphism test which is computationally expensive

    Approaches Apriori based approach Pattern Growth approach

    Apriori based ApproachStart with graph of small size generate candidates with extra vertex/edge or path

    Apriori Approach AGM (Apriori-based Graph Mining) Vertex based candidate generation increases sub structure size by one vertex at each step Two frequent k size graphs are joined only if they have the same (k-1) subgraph (Size number of vertices) New candidate has (k-1) sized component and the additional two vertices Two different sub-structures can be formed

    FSG (Frequent Sub-graph mining) Edge-based Candidate generation increases by one-edge at a time Two size k patterns are merged iff they share the same subgraph having k-1 edges (core) New candidate has core and the two additional edges

    Edge disjoint path method Classify graphs by number of disjoint paths they have Two paths are edge-disjoint if they do not share any common edge A substructure pattern with k+1 disjoint paths is generated by joining sub-structures with k disjoint paths

    Disadvantage of Apriori Approaches Overhead when joining two sub-structures Uses BFS strategy : level-wise candidate generation To check whether a k+1 graph is frequent it must check all of its size-k sub graphs May consume more memory

    Pattern-Growth Approach Uses BFS as well as DFS A graph g can be extended by adding a new edge e. The newly formed graph is denoted by g (x e. Edge e may or may not introduce a new vertex to g. If e introduces a new vertex, the new graph is denoted by g (xf e, otherwise, g (xb e, where f or b indicates that the extension is in a forward or backward direction.

    Pattern Growth Approach For each discovered graph g performs extensions recursively until all frequent graphs with g are found Simple but inefficient Same graph is discovered multiple times duplicate graph

    Pattern Growth in gSpan Algorithm Reduces generation of duplicate graphs Does not extend duplicate graphs Uses Depth First Order A graph may have several DFS-trees Visiting order of vertices forms a linear order - Subscript In a DFS tree starting vertex root; last visited vertex right-most vertex Path from v0 to vn right most path

    gSpan Algorithm gSpan restricts the extension method A new edge e can be added between the right-most vertex and another vertex on the right-most path (backward extension); or it can introduce a new vertex and connect to a vertex on the right-most path (forward extension)

    Right-most extension, denoted by G (r e

    Chooses any one DFS tree base subscripting and extends it Each subscripted graph is transformed into an edge sequence DFS code Select the subscript that generates minimum sequence Edge Order maps edges in a subscripted graph into a sequence Sequence Order builds an order among edge sequences

    Root Empty code Each node is a DFS code encoding a graph Each edge rightmost extension from a (k-1) length DFS code to a k-length DFS code If codes s and s encode the same graph search space s can be safely pruned

    gSpan Algorithm Mining Closed Frequent Substructures Helps to overcome the problem of pattern explosion A frequent graph G is closed if and only if there is no proper super graph G0 that has the same support as G. Closegraph Algorithm

    A frequent pattern G is maximal if and only if there is no frequent super-pattern of G. Maximal pattern set is a subset of the closed pattern set. But cannot be used to reconstruct entire set of frequent patterns

    Mining Alternative Substructure Patterns Mining unlabeled or partially labeled graphs New empty label ( is assigned to vertices and edges that do not have labels

    Mining non-simple graphs A non simple graph may have a self-loop and multiple edges growing order - backward edges, self-loops, and forward edges To handle multiple edges - allow sharing of the same vertices in two neighboring edges in a DFS code

    Mining directed graphs 6-tuple (i; j; d; li; l(i; j) ; lj ); d = +1 / -1

    Mining disconnected graphs Graph / Pattern may be disconnected Disconnected Graph Add virtual vertex Disconnected graph pattern set of connected graphs

    Mining frequent subtrees Tree Degenerate graph

    Constraint based Mining of Substructure Patterns Element, set, or subgraph containment constraint user requires that the mined patterns contain a particular set of subgraphs - Succinct constraint

    Geometric constraint A geometric constraint can be that the angle between each pair of connected edges must be within a range Anti-monotonic constraint

    Value-sum constraint the sum_of (positive) weights on the edges, must be within a range low and high (sum > low) Monotonic / Anti-monotonic (sum < high)

    Multiple categories of constraints may also be enforced

    Mining Approximate Frequent Substructures Approximate frequent substructures allow slight structural variations Several slightly different frequent substructures can be represented using one approximate substructure

    SUBDUE Substructure discovery system based on the Minimum Description Length (MDL) principle adopts a constrained beam search SUBDUE performs approximate matching

    Mining Coherent and Dense Sub structures A frequent substructure G is a coherent sub graph if the mutual information between G and each of its own sub graphs is above some threshold Reduces number of patterns mined Application: coherent substructure mining selects a small subset of features that have high distinguishing power between protein classes.

    Relational graph each label is used only once Frequent highly connected or dense subgraph mining People with strong associations in OSNs Set of genes within the same functional module

    Cannot judge based on average degree or minimal degree Must ensure connectedness Example: Average degree: 3.25Minimum degree 3

    Mining Dense Substructures Dense graphs defined in terms of Edge Connectivity Given a graph G, an edge cut is a set of edges Ec such that E(G) - Ec is disconnected. A minimum cut is the smallest set in all edge cuts. The edge connectivity of G is the size of a minimum cut.

    A graph is dense if its edge connectivity is no less than a specified minimum cut threshold

    Mining Dense substructures Pattern-growth approach called Close-Cut (Scalable) starts with a small frequent candidate graph and extends it until it finds the largest super graph with the same support

    Pattern-reduction approach called Splat (High performance) directly intersects relational graphs to obtain highly connected graphs A pattern g discovered in a set is progressively intersected with subsequent components to give g Some edges in g may be removed The size of candidate graphs is reduced by intersection and decomposition operations.

    Applications Graph Indexing Indexing is essential for efficient search and query processing Traditional approaches are not feasible for graphs Indexing based on nodes / edges / sub-graphs Path based Indexing approach Enumerate all the paths in a database up to maxL length and index them Index is used to identify all graphs with the paths in query Not suitable for complex graph queries Structural information is lost when a query graph is broken apart Many false positives maybe returned

    gIndex considers frequent and discriminative substructures as index features A frequent substructure is discriminative if its support cannot be approximated by the intersection of the graph sets Achieves good performance at less cost

    Graph Indexing Substructure Similarity Search Bioinformatics and Chem-informatics applications involve query based search in massive complex structural data

    Substructure Similarity Search Grafil (Graph Similarity Filtering) Feature based structural filtering Models each query graph as a set of features Edge deletions feature misses Too many features reduce performance Multi-filter composition strategy Feature Set - group of similar features

    Classification and Cluster Analysis using Graph Patterns Graph Classification Mine frequent graph patterns Features that are frequent in one class but less in another Discriminative features Model construction Can adjust frequency, connectivity thresholds SVM, NBM etc are used

    Cluster Analysis Cluster Similar graphs based on graph connectivity (minimal cuts) Hierarchical clusters based on support threshold Outliers can also be detected

    Inter-related process