University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou...

Post on 20-Dec-2015

215 views 1 download

Transcript of University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou...

University of Crete CS483 1

The use of Minimum Spanning Trees in microarray expression data

Gkirtzou Ekaterini

University of Crete CS483 2

Introduction

Classic clustering algorithms, like K-means, self-organizing maps, etc., have certain drawbacks No guarantee for global optimal results Depend on geometric shape of cluster

boundaries (K-means)

University of Crete CS483 3

Introduction

MST clustering algorithms Expression data clustering analysis

(Xu et al -2001) Iterative clustering algorithm

(Varma et al - 2004) Dynamically growing self-organizing

tree (DGSOT) (Luo et al - 2004)

University of Crete CS483 4

Definitions

A minimum spanning tree (MST) of a weighted, undirected graph with weights is an acyclic subset that contains all of the vertices and whose total weight

is minimum.

T G

( ) ( )e T

w T w e

( , )G V E( )w e

University of Crete CS483 5

Definitions

The DNA microarray technology enables the massive parallel measurement of gene expression of thousands genes simultaneously. Its usefulness: compare the activity of genes in diseased

and healthy cells categorize a disease into subgroups discover new drug and toxicology studies.

University of Crete CS483 6

Definitions

Clustering is a common technique for data analysis. Clustering partitions the data set into subsets (clusters), so that the data in each subset share some common trait.

University of Crete CS483 7

MST clustering algorithms

Expression data clustering analysis (Xu et al -2001)

Iterative clustering algorithm (Varma et al - 2004)

Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)

University of Crete CS483 8

Expression data clustering analysis

Let be a set of expression data with each representing the expression levels at time 1 through time t of gene i. We define a weighted, undirected graph as follows. The vertex set

and the edge set .

{ }iD d1( , , )ti i id e e

( , )G V E{ | }i iV d d D

{( , ) | , and }i j i jE d d for d d D i j

University of Crete CS483 9

Expression data clustering analysis

G is a complete graph. The weight of its edge is the distance

of the two vertices e.g. Euclidean distance, Correlation coefficient, etc.

Each cluster corresponds to one subtree of the MST.

No essential information is lost for clustering.

University of Crete CS483 10

Clustering through removing long MST-edges

Based on intuition of the cluster

Works very well when inter-cluster edges are larger than intra-cluster ones

University of Crete CS483 11

An iterative Clustering

Minimize the distance between the center of a cluster and its data

Starts with K arbitrary clusters of the MST for each pair of adjacent clusters finds

the edge to cut, which optimizes

1

( , ( ))i

K

ii d T

d center T (1)

(1)

University of Crete CS483 12

A globally optimal clustering

Tries to partition the tree into K subtrees

Select K representatives to optimize

1

( , )i

K

ii d T

d d

(2)

(2)

University of Crete CS483 13

MST clustering algorithms

Expression data clustering analysis (Xu et al -2001)

Iterative clustering algorithm (Varma et al - 2004)

Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)

University of Crete CS483 14

Iterative clustering algorithm

The clustering measure used here is Fukuyama-Sugeno

where , are the two partitions of the set S, with each contains samples, denote by the mean of the samples in and the global mean of all samples. Also denote by the j-th sample in the cluster

2 2 2

1 1

( )kN

kj k k

k j

FS S x

1S 2SkS kN

k kjx

kS

kS

University of Crete CS483 15

Iterative clustering algorithm

Feature selection counts the gene’s support to a partition

Feature selection used here is t-statistic with pooled variance. T-statistic is heuristic measure

Genes with absolute t-statistic greater than a threshold are selected

University of Crete CS483 16

Iterative clustering algorithm

Create an MST from all genes Delete edges from MST and obtain

binary partitions. Select the one with minimum F-S clustering measure

The feature selection is used to select a subset of genes that single out between the clusters

University of Crete CS483 17

Iterative clustering algorithm

In the next iteration the clustering is done in this selected set of genes

Until the selected gene subset converges

Remove them form the pool and continue.

University of Crete CS483 18

MST clustering algorithms

Expression data clustering analysis (Xu et al -2001)

Iterative clustering algorithm (Varma et al - 2004)

Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)

University of Crete CS483 19

Dynamically growing self-organizing tree (DGSOT)

In the previous algorithms the MST is constructed on the original set of data and used to test the intra-cluster quantity, while here the MST is used as a criterion to test the inter-cluster property.

University of Crete CS483 20

DGSOT algorithm

Tree structure self-organizing neural network

Grows vertically and horizontally Starts with a root-leaf node In every vertical growing every leaf

node with heterogeneity two descendents are created and the learning process take place

et RH T

University of Crete CS483 21

DGSOT algorithm Heterogeneity

Variability (maximum distance between input data and node)

Average distortion d of a leaf

D: total number of input

data of lead i : distance between data j and leaf i : reference vector of leaf i

1

( , )i

Dj i

j

d x nd

D

( , )j id x nin

University of Crete CS483 22

DGSOT algorithm

In every horizontal growing for every lowest non-leaf node a child is added until the validation criterion is satisfied and the learning process take place

The learning process distributes the data to the leaves in the best way. The best matching node has the minimum distance to the input data

University of Crete CS483 23

The validation criterion of DGSOT

Calculated without human intervention

Based on geometric characteristics of the clusters

Create the Voronoi diagram for the input data. The Voronoi diagram divides the set D data into n regions V(p):V(p) = {x | ( , ) ( , ) }D dist x p dist x q q

University of Crete CS483 24

The validation criterion of DGSOT

Let’s define a weighted, undirected graph .The vertices is the set of the centroids of the Voronoi cell and the edge set is defined as

Create the MST for the graph

( , )G V E( )V p

{( , ) | , ( ) and }i j i jE p p p p V p i j

( , )G V E

University of Crete CS483 25

Voronoi diagram of 2D dataset

In A, the dataset is partitioned into three Voronoi cells. The MST of the centroid is ‘even’.

In B, the dataset is partitioned into four Voronoi cells. The MST of the centroid is not ‘even’.

University of Crete CS483 26

The validation criterion of DGSOT

Cluster separation`

where is minimum length edge and is the maximum length edge

A low value of the CS means that the two centroids are to close to each other and the Voronoi partition is not valid, while a high CS value means that the Voronoi partition is valid.

min

max

ECS

E

minE

maxE

University of Crete CS483 27

Example of DGSOT

University of Crete CS483 28

Conclusions

The tree algorithms presented in this report have provided comparable result to those obtained by classic clustering algorithms, without their drawbacks, and superior to those obtained by standard hierarchical clustering.

University of Crete CS483 29

Questions