University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou...

29
University of Cr ete CS483 1 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou...

Page 1: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 1

The use of Minimum Spanning Trees in microarray expression data

Gkirtzou Ekaterini

Page 2: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 2

Introduction

Classic clustering algorithms, like K-means, self-organizing maps, etc., have certain drawbacks No guarantee for global optimal results Depend on geometric shape of cluster

boundaries (K-means)

Page 3: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 3

Introduction

MST clustering algorithms Expression data clustering analysis

(Xu et al -2001) Iterative clustering algorithm

(Varma et al - 2004) Dynamically growing self-organizing

tree (DGSOT) (Luo et al - 2004)

Page 4: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 4

Definitions

A minimum spanning tree (MST) of a weighted, undirected graph with weights is an acyclic subset that contains all of the vertices and whose total weight

is minimum.

T G

( ) ( )e T

w T w e

( , )G V E( )w e

Page 5: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 5

Definitions

The DNA microarray technology enables the massive parallel measurement of gene expression of thousands genes simultaneously. Its usefulness: compare the activity of genes in diseased

and healthy cells categorize a disease into subgroups discover new drug and toxicology studies.

Page 6: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 6

Definitions

Clustering is a common technique for data analysis. Clustering partitions the data set into subsets (clusters), so that the data in each subset share some common trait.

Page 7: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 7

MST clustering algorithms

Expression data clustering analysis (Xu et al -2001)

Iterative clustering algorithm (Varma et al - 2004)

Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)

Page 8: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 8

Expression data clustering analysis

Let be a set of expression data with each representing the expression levels at time 1 through time t of gene i. We define a weighted, undirected graph as follows. The vertex set

and the edge set .

{ }iD d1( , , )ti i id e e

( , )G V E{ | }i iV d d D

{( , ) | , and }i j i jE d d for d d D i j

Page 9: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 9

Expression data clustering analysis

G is a complete graph. The weight of its edge is the distance

of the two vertices e.g. Euclidean distance, Correlation coefficient, etc.

Each cluster corresponds to one subtree of the MST.

No essential information is lost for clustering.

Page 10: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 10

Clustering through removing long MST-edges

Based on intuition of the cluster

Works very well when inter-cluster edges are larger than intra-cluster ones

Page 11: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 11

An iterative Clustering

Minimize the distance between the center of a cluster and its data

Starts with K arbitrary clusters of the MST for each pair of adjacent clusters finds

the edge to cut, which optimizes

1

( , ( ))i

K

ii d T

d center T (1)

(1)

Page 12: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 12

A globally optimal clustering

Tries to partition the tree into K subtrees

Select K representatives to optimize

1

( , )i

K

ii d T

d d

(2)

(2)

Page 13: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 13

MST clustering algorithms

Expression data clustering analysis (Xu et al -2001)

Iterative clustering algorithm (Varma et al - 2004)

Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)

Page 14: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 14

Iterative clustering algorithm

The clustering measure used here is Fukuyama-Sugeno

where , are the two partitions of the set S, with each contains samples, denote by the mean of the samples in and the global mean of all samples. Also denote by the j-th sample in the cluster

2 2 2

1 1

( )kN

kj k k

k j

FS S x

1S 2SkS kN

k kjx

kS

kS

Page 15: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 15

Iterative clustering algorithm

Feature selection counts the gene’s support to a partition

Feature selection used here is t-statistic with pooled variance. T-statistic is heuristic measure

Genes with absolute t-statistic greater than a threshold are selected

Page 16: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 16

Iterative clustering algorithm

Create an MST from all genes Delete edges from MST and obtain

binary partitions. Select the one with minimum F-S clustering measure

The feature selection is used to select a subset of genes that single out between the clusters

Page 17: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 17

Iterative clustering algorithm

In the next iteration the clustering is done in this selected set of genes

Until the selected gene subset converges

Remove them form the pool and continue.

Page 18: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 18

MST clustering algorithms

Expression data clustering analysis (Xu et al -2001)

Iterative clustering algorithm (Varma et al - 2004)

Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)

Page 19: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 19

Dynamically growing self-organizing tree (DGSOT)

In the previous algorithms the MST is constructed on the original set of data and used to test the intra-cluster quantity, while here the MST is used as a criterion to test the inter-cluster property.

Page 20: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 20

DGSOT algorithm

Tree structure self-organizing neural network

Grows vertically and horizontally Starts with a root-leaf node In every vertical growing every leaf

node with heterogeneity two descendents are created and the learning process take place

et RH T

Page 21: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 21

DGSOT algorithm Heterogeneity

Variability (maximum distance between input data and node)

Average distortion d of a leaf

D: total number of input

data of lead i : distance between data j and leaf i : reference vector of leaf i

1

( , )i

Dj i

j

d x nd

D

( , )j id x nin

Page 22: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 22

DGSOT algorithm

In every horizontal growing for every lowest non-leaf node a child is added until the validation criterion is satisfied and the learning process take place

The learning process distributes the data to the leaves in the best way. The best matching node has the minimum distance to the input data

Page 23: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 23

The validation criterion of DGSOT

Calculated without human intervention

Based on geometric characteristics of the clusters

Create the Voronoi diagram for the input data. The Voronoi diagram divides the set D data into n regions V(p):V(p) = {x | ( , ) ( , ) }D dist x p dist x q q

Page 24: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 24

The validation criterion of DGSOT

Let’s define a weighted, undirected graph .The vertices is the set of the centroids of the Voronoi cell and the edge set is defined as

Create the MST for the graph

( , )G V E( )V p

{( , ) | , ( ) and }i j i jE p p p p V p i j

( , )G V E

Page 25: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 25

Voronoi diagram of 2D dataset

In A, the dataset is partitioned into three Voronoi cells. The MST of the centroid is ‘even’.

In B, the dataset is partitioned into four Voronoi cells. The MST of the centroid is not ‘even’.

Page 26: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 26

The validation criterion of DGSOT

Cluster separation`

where is minimum length edge and is the maximum length edge

A low value of the CS means that the two centroids are to close to each other and the Voronoi partition is not valid, while a high CS value means that the Voronoi partition is valid.

min

max

ECS

E

minE

maxE

Page 27: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 27

Example of DGSOT

Page 28: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 28

Conclusions

The tree algorithms presented in this report have provided comparable result to those obtained by classic clustering algorithms, without their drawbacks, and superior to those obtained by standard hierarchical clustering.

Page 29: University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

University of Crete CS483 29

Questions