University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou...

University of Crete CS483 1

The use of Minimum Spanning Trees in microarray expression data

Gkirtzou Ekaterini

Introduction

Classic clustering algorithms, like K-means, self-organizing maps, etc., have certain drawbacks No guarantee for global optimal results Depend on geometric shape of cluster

boundaries (K-means)

Introduction

MST clustering algorithms Expression data clustering analysis

(Xu et al -2001) Iterative clustering algorithm

(Varma et al - 2004) Dynamically growing self-organizing

tree (DGSOT) (Luo et al - 2004)

Definitions

A minimum spanning tree (MST) of a weighted, undirected graph with weights is an acyclic subset that contains all of the vertices and whose total weight

is minimum.

( ) ( )e T

w T w e

( , )G V E( )w e

Definitions

The DNA microarray technology enables the massive parallel measurement of gene expression of thousands genes simultaneously. Its usefulness: compare the activity of genes in diseased

and healthy cells categorize a disease into subgroups discover new drug and toxicology studies.

Definitions

Clustering is a common technique for data analysis. Clustering partitions the data set into subsets (clusters), so that the data in each subset share some common trait.

MST clustering algorithms

Expression data clustering analysis (Xu et al -2001)

Iterative clustering algorithm (Varma et al - 2004)

Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)

Expression data clustering analysis

Let be a set of expression data with each representing the expression levels at time 1 through time t of gene i. We define a weighted, undirected graph as follows. The vertex set

and the edge set .

{ }iD d1( , , )ti i id e e

( , )G V E{ | }i iV d d D

{( , ) | , and }i j i jE d d for d d D i j

Expression data clustering analysis

G is a complete graph. The weight of its edge is the distance

of the two vertices e.g. Euclidean distance, Correlation coefficient, etc.

Each cluster corresponds to one subtree of the MST.

No essential information is lost for clustering.

Clustering through removing long MST-edges

Based on intuition of the cluster

Works very well when inter-cluster edges are larger than intra-cluster ones

An iterative Clustering

Minimize the distance between the center of a cluster and its data

Starts with K arbitrary clusters of the MST for each pair of adjacent clusters finds

the edge to cut, which optimizes

( , ( ))i

ii d T

d center T (1)

A globally optimal clustering

Tries to partition the tree into K subtrees

Select K representatives to optimize

( , )i

ii d T

Iterative clustering algorithm

The clustering measure used here is Fukuyama-Sugeno

where , are the two partitions of the set S, with each contains samples, denote by the mean of the samples in and the global mean of all samples. Also denote by the j-th sample in the cluster

kj k k

FS S x

1S 2SkS kN

Feature selection counts the gene’s support to a partition

Feature selection used here is t-statistic with pooled variance. T-statistic is heuristic measure

Genes with absolute t-statistic greater than a threshold are selected

Create an MST from all genes Delete edges from MST and obtain

binary partitions. Select the one with minimum F-S clustering measure

The feature selection is used to select a subset of genes that single out between the clusters

In the next iteration the clustering is done in this selected set of genes

Until the selected gene subset converges

Remove them form the pool and continue.

Dynamically growing self-organizing tree (DGSOT)

In the previous algorithms the MST is constructed on the original set of data and used to test the intra-cluster quantity, while here the MST is used as a criterion to test the inter-cluster property.

DGSOT algorithm

Tree structure self-organizing neural network

Grows vertically and horizontally Starts with a root-leaf node In every vertical growing every leaf

node with heterogeneity two descendents are created and the learning process take place

et RH T

DGSOT algorithm Heterogeneity

Variability (maximum distance between input data and node)

Average distortion d of a leaf

D: total number of input

data of lead i : distance between data j and leaf i : reference vector of leaf i

( , )i

d x nd

( , )j id x nin

DGSOT algorithm

In every horizontal growing for every lowest non-leaf node a child is added until the validation criterion is satisfied and the learning process take place

The learning process distributes the data to the leaves in the best way. The best matching node has the minimum distance to the input data

The validation criterion of DGSOT

Calculated without human intervention

Based on geometric characteristics of the clusters

Create the Voronoi diagram for the input data. The Voronoi diagram divides the set D data into n regions V(p):V(p) = {x | ( , ) ( , ) }D dist x p dist x q q

Let’s define a weighted, undirected graph .The vertices is the set of the centroids of the Voronoi cell and the edge set is defined as

Create the MST for the graph

( , )G V E( )V p

{( , ) | , ( ) and }i j i jE p p p p V p i j

( , )G V E

Voronoi diagram of 2D dataset

In A, the dataset is partitioned into three Voronoi cells. The MST of the centroid is ‘even’.

In B, the dataset is partitioned into four Voronoi cells. The MST of the centroid is not ‘even’.

Cluster separation`

where is minimum length edge and is the maximum length edge

A low value of the CS means that the two centroids are to close to each other and the Voronoi partition is not valid, while a high CS value means that the Voronoi partition is valid.

Example of DGSOT

Conclusions

The tree algorithms presented in this report have provided comparable result to those obtained by classic clustering algorithms, without their drawbacks, and superior to those obtained by standard hierarchical clustering.

Questions

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou...

Documents

Transcript of University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou...

Microarray Analysis - The Basicstgirke/HTML_Presentations/Manuals/Microarray/... · Microarray Analysis The Basics Thomas Girke December 9, 2011 Microarray Analysis Slide 1/42

Lecture 8 Microarray experiments MA plots Normalization of microarray data

Microarray Full

Microarray Background

Microarray hybridization

Microarray (DNA and SNP microarray)

Why Microarray?

Microarray Normalization

Lecture 9 Microarray experiments MA plots Normalization of microarray data

George Papadakis, Ekaterini Ioannou, Themis Palpanas ...disi.unitn.it/~themis/publications/erframework-tr12.pdf · George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee,

ProtoArray Human Protein Microarray v5.0 Kinase Substrate ... · Microarray The ProtoArray Human Protein Microarray is a high-density protein microarray containing human proteins.

DNA microarray

Clustering Algorithms for Microarray Data Miningusers.eecs.northwestern.edu/~yingliu/datamining_papers/microarray... · Clustering Algorithms for Microarray Data Mining ... MICROARRAY

FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY ... · · 2015-09-06FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY

SPANNING TREE PROTOKOL SPANNING TREE PROTOCOL

ROBBINS GELLER RUDMAN & DOWD LLP 2 EKATERINI M ...

Tutorial Microarray

Microarray Basics

Microarray Data Pre-processing - Bioinfo · Microarray Data Pre-processing Ana H. Barragan Lid. Hybridized Microarray •Imaged in a microarray scanner •Scanner produces fluorescence

Plant Microarray