Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio...

50
Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen, HY, Institute of

Transcript of Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio...

Page 1: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Improved Bayesian segmentation with a novel application in genome

biology Petri Pehkonen, Kuopio University

Garry Wong, Kuopio University

Petri Törönen, HY, Institute of Biotechnology

Page 2: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Outline

• little motivation• some heuristics used• proposed Bayes model

– represents also a modified Dirichlet prior

• proposed testing with artificial data– discusses the use of prior information in the

evaluation

• little analysis of real datasets

Page 3: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Biological problem setupInput• Genes and their associations with biological features like

regulation, expression clusters, functions etc.

Assumption• Neighbouring genes of genome may share same features

Aim• Find the chromosomal regions "over-related" to some

biological feature or combination of features, look for non-random localization of features

I will discuss more about gene expression data application

Page 4: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

A comparison with some earlier work with expression data

• Our aim is to analyze the gene expression from the genome with a new perspective– standard: Consider very

local areas of ~constant expression levels

– our view: How about looking at larger regions that have clearly more active genes (under certain conditions)?

Our perspective is related with the idea of active and passive regions of the genome

Page 5: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Further comparison with earlier work

• Standard: Up/Down/No regulation classification or real value from each experiment as input vector for gene

• Our idea: One can also associate genes to clusters in varying clustering solutions.multinomial variable/vector for single geneBy using varying number of clusters one should obtain

broader and narrower classes

This is related with the idea of combining weak coherent signals occuring in various measurements with clusters

Page 6: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Methodological problem setupGene participance in co-expression clusters• Genes can be partitioned into separate clusters according to

expression similarity: first 2 clusters, then 3, then 4 etc.

• Aim is to find chromosomal regions where consecutive genes are in same expression clusters in different clustering results

6 gene expression clusters 0 0 1 5 2 3 6 5 0 3 3 3 4 0 0

5 gene expression clusters 0 0 5 4 5 2 1 2 0 4 4 4 4 0 0

4 gene expression clusters 0 0 3 3 4 3 3 3 0 2 2 1 2 0 0

3 gene expression clusters 0 0 3 3 3 3 3 3 0 1 1 1 1 1 0

2 gene expression clusters 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0

Gene order in chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Broader expression similarity

Specific expression similarity

Page 7: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Existing segmentation algorithmsNon-heuristic:• Dynamic programming

Heuristic:• Hierarchical

– Top-down/bottom-up

– Recursive/iterative

• K-means reminding solutions (EM-methods)• Sliding window with adaptive window size (?)• etc. ..

Page 8: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Hierarchical vs. Non-hierarchical Heuristic methods

• Non-hierarchical heuristic methods usually produce only a single solution.

• compare k-means in clustering

• these often require a parameter (number of change-points )

• Aims to create (local) optimal solution for the number of change point

• Hierarchical heuristic methods produce a large group of solutions with varying number of change-points

• Large group of solutions can be created with one run

• Solutions could be usually optimized further

Page 9: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Recursive vs. Iterative hierarchical heuristics

• Recursive hierarchical heuristics• Slices usually until some stopping rule (BIC penalty) is

fullfilled.• Each segment is sliced independent from the rest of the data• Hard to obtain a solution for varying number of change points • Designed to stop at optimum (which can be a local optimum)

• Top-Down (?) hierarchical heuristics• Slices until a stopping rule or maximum number of clusters is

fullfilled• Each new change-point is placed after all segments are

analyzed. The best change-point from all segments is selected.• Creates a chain of solutions with varying number of segments• Can be run past the (local) optimum to see if we can find a

better solution after few bad results.

Page 10: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Our choice for heuristic search

• Top-Down hierarchical segmentation

Page 11: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

How to place a new change-point

The new change-point position is usually selected using a statistical measure:

• Optimization of Log of Likelihood-ratio (ratio of ML based solutions)

• lighter to calculate• often referred as Jensen-Shannon Divergence

• Optimization of our Bayes Factor• bayes model discussed later• natural choice (as this is what we want to optimize)

Bayes factor would seem natural but:In testing we noticed that we started splitting only the smallest segments?? Why???

Page 12: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Bias in bayesian scoreThe first fig. represents random data (no preferred change-point position)The second fig. represents the behaviour of log likelihood ratio model. (ML method)The third fig. represents the behaviour of Bayes Factor (BF)•The highest point of each profile is taken as change point•Notice the bias in BF that favours cutting near the ends•Still all BFs are negative (against splicing)Problems when we force the algorithm to go pass the local optimum

=> We chose the ML to change point search

Page 13: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

What we have obtained so far…

• Top-Down hierarchical heuristical segmentation

• ML based measure (JS divergence) used to select the next change-point

Page 14: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Selecting optimal solution from hierarchy

• Hierarchical segmentation for n sized data contains n different nested solutions

• The solutions must be evaluated in order to find proper one: not too general, not too complex

• Need model selection

Page 15: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Model selection used for segmentation models

Two ideas occuring in the most used methods:• Evaluating "the fit" of model (usually ML score)• Penalization for used parameters (in the model)

– Segmentation model parameters: data classes in segments and the positioning of change points

We used few (ML based) model selection methods:• AIC• BIC• Modified BIC (designed for segmentation tasks)

We were not happy with their performance, therefore ..

Page 16: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Our model selection criterion• Bayesian approach => Takes account uncertainity

and a priori information on parameters• The change-point model M includes two varying

parameter groups:A. Class proportions within segments

B. Change-points (segment borders)

• Posterior probability for the model M fitted in data D would be to integrate over A and B parameter spaces:

dAdBBPAPBAMDPMDPDMP )()(),,|()|()|(

Page 17: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Our approximations/assumptions: A

• clusters do not affect each other (independence)• data dimensions do not affect each other

(independence)

These two allow simple multiplication• segmentation does not affect directly the modeling

the modelling of the data in the cluster

Only the model and the prior A affect the likelihood of the data: P(D|M,A,B) = P(D|M,A)

Page 18: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Our model selection criterion

• Therefore a Multinomial model with a Dirichlet prior can be used to calculate the integrated likelihood:

V

v

I

i vi

viviI

ivi

I

ivi

I

ivi

v

vv

v

x

x

dAAPAMDP

1 1

11

1

)(

)(

)(),|(

multiplication over classes

in one dimensionmultiplication over

dimensions

Page 19: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

• We assume all the change-points exchangeable• order of finding change-points does not matter for the solution

• We do not integrate over the parameter space B, but analyze only the MAP solution

• need a proper prior for B..

Further assumptions/approximations: B

Page 20: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

• We select flat prior for simplicity– This makes the MAP equal to ML solution

• Prior of parameters B is 1 divided with how many ways the current m change-point estimates can be positioned into data with size n:

m

nBP

11

)(

Our model evaluation criterion

Page 21: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

• The final form of our criterion is (without the log):

"Flat" MAP-estimate for parameters B.

Posterior probability ofparameter group A

Our model evaluation criterion

C

c

V

v

I

i cvi

cvicviI

icvi

I

icvi

I

icvi

MAP

v

vv

v

x

xm

N

dkAPAMDPBPMDP

1 1 1

11

1

)(

)(1

1

)(),|()()|(

Multiplication goes over various clusters (c), and various dimensions (v). Quite simple equation.

Page 22: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

What about the Dirichlet prior weights

Multinomial model requires prior parameters:

• Standard Dirichlet prior weights:

I) all the prior weights same (FLAT)

II) prior probabilities equal the class probabilities in the whole dataset (CSP)

• These require the definition of prior sum (ps)

We used ps = 1 (CSP1, FLAT1) and ps = number of classes (CSP, FLAT) for both of previous prios

• Empirical Bayes (?) prior (EBP): prior II with ps = SQRT(Nc) (Carlin, others, ’scales according the std’)

Page 23: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

• We considered EBP reasonable but…• With small class proportions and small clusters

EBP problematic– gamma function of Dirichlet equation probably

approaches infinity (as prior approaches zero)

• Modified EBP (MEBP) mutes this behaviour:

Instead of

• now prior weights approach 0 slower, when class proportion is small

…Dirichlet prior weights

)( iXPNci

)( iXPNci

Page 24: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

• Also now the ps in MEBP is dependent on the class distribution (more even distribution => bigger ps). Also larger number of classes => bigger ps. Both these sound natural…

• Prior weight can be also linked to Chi Square test.

…Dirichlet prior weights

Page 25: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

What we have obtained so far…

• Top-Down hierarchical heuristical segmentation

• ML based measure (JS divergence) used to select the next change-point

• Results from heuristics are analyzed using proposed Bayes model– flat prior using number of potential solutions for

segmentation with same m.– MEBP prior for multinomial data

Page 26: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Evaluation

• testing using artificial data• we can vary number of clusters, number of classes, class

distributions and monitor the performance

• Do hierarchical segmentation • Select the best result with various methods• Standard measure for evaluation: Compare how

well the clusters obtained correspond to clusters used in the data generation

• But is the good correlation/correspondence always what we want to see?

Page 27: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

When correlation fails

• many clusters/segments and few data points• consecutive small segments • similar neighboring segments• One segment in the obtained segmentation (or in

the data-generation) => no correspondence

Problem: Correlation does not account Occam’s Razor

Page 28: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Our proposal • Base the evaluation to the similarity of the statistical

model used to generate each data point (DGM) vs. the data model obtained from clustering for the datapoint (DEM)– Reminds standard cross validation

• Use a probability distribution distance measure to monitor how similar they are– one can think this as infinite size testing data set

• Need only to select the distance measure• extra-plus: with hierarchical results we can look the

optimal result and see if a method overestimates or underestimates it.

Page 29: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Probability distribution distance measures

• Kullback-Leibler Divergence (most natural)

• Inverse of the KL

• Jensen-Shannon Divergence

• Other measures were also tested..

))(/)(log()()/log()||( iYPiXPiXPYXEYXD XKL

)||()||(_ XYDYXD KLInvKL

)2/)(||()2/)(||()||( YXYDYXXDYXD KLKLJS

X is here DGMY is the DEM (obtained from segments)

Page 30: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

The Good, the Bad and…

• DEM can have data points with P(X=i) = 0– These create infinite score in DKL

– Under-estimates the optimal model

• DKL_Inv was considered to correct this, but– P(Xi) = 0 causes now too many zero scores

• x*log(x) when x => 0 was defined as 0

– over-estimates heavily the model

• DJS was selected as a comprimise between these two phenomenas

Page 31: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Do we want to use prior info• Standard cross validation: Bayes method uses prior

information, ML does not use prior information• Is this fair?

– same result with and without prior gets different score – method with prior usually gets better results

• Our (=My!) opinion: evaluation should use same amount of prior info for all the methods! – we would get same score for same result (independent from the

method)– we would pick the model from the model group that usually

performs better• Selecting the prior for evaluation process is now an open

question!

Page 32: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Defending note

• Amount of prior only affects the results from one group of artificial datasets analyzed (sparse signal /small clusters)

• These are datasets where bayes methods behave differently.

• Revelation from the results: ML methods perform worse also in datasets where prior has little effect

• => Use of prior mainly important for our Bayes method prior comparisons…

Page 33: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Rules for selecting priorto model evaluation

• Obtained DEM should be as close to DGM as possible (=more correct, smaller DJS)

• The used prior should be based on something else than our favorite MEBP

• Hoping we would not get good results with MEBP just because of the same prior

• Use as little prior as possible• want the segment area to have as much affect as possible

• Better ideas?

Page 34: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Comparison of model evaluation priors

• Used a small cluster data with 10 and 30 classes (=prior effects the results)

• Used CSP (class prior = class probability* ps), with ps = 1, 2, c/4, c/2, c*3/4, c, 10*c (c = number of classes)

• Looked for obtained DJS for various segmenting outcomes (from hierarchical results) with 1 – n clusters (n =max(5,k), k= artificial data cluster number)

• Analysis was done with artificial datasets

Page 35: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

0 1 2 2.5 5 7.5 10 100

0

10

20

30

40

50

Jen

sen

-Sh

an

no

n d

ive

rge

nce

Prior sum

Data with 10 classes

0 1 2 7.5 15 22.5 30 3000

20

40

60

80

Prior sum

Data with 30 classes

Comparison of model evaluation priors

•ps = 1, 2, c/4, c/2, 3*c/4, c, 10*c

The approximate minimum at ps = number of classes

Page 36: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

0 1 2 2.5 5 7.5 10 1000

10

20

30

40

50

Jen

sen

-Sh

an

no

n d

ive

rge

nce

Prior sum

Data with 10 classes

0 1 2 7.5 15 22.5 30 3000

20

40

60

80

Prior sum

Data with 30 classes

Comparison of priors• We did not look minimum, but wanted the compromise

between minimum and weak prior effect:

• We chose ps = c/2

• Choice quite arbitrary but a quick analysis with neighbor priors gave similar results

Page 37: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Proposed method+artificial data evaluation

• Top-Down hierarchical heuristical segmentation with ML used to select the next change-point

• Results from heuristics are analyzed using proposed Bayes model

• Evaluation of the results using the artificial data– estimate how well the obtained model predicts the future

data sets

– compare the models with DJS that uses also prior information

Page 38: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

More on evaluation• Three data types (with varying number of

classes):i) several (1 – 10) large segments (each 30 – 300 data points)

• this should be ~easy to analyze

ii) few (1 – 4) large segments (30 – 300 data points)• this should have less reliable prior class distribution

iii) several (1 – 10) small segments (15 – 60 data points)• most difficult to analyze• prior affects these results

• Number of classes used in each: 2, 10, 30– data sparseness increases with increasing number of

classes

• Data classes were made skewed

Page 39: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

…evaluation…

• Data segmented by Top-Down: 1 – 100 segments• Model selection methods used to pick optimal

segmentation– ML methods: AIC, BIC, modified BIC– Bayes method with dirichlet priors: FLAT1, FLAT,

CSP1, CSP, EBP, MEBP

• Each test replicated 100 times• Djs calculated between DGM and the obtained

DEM

Page 40: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

…still evaluating• As mentioned: the smaller the JS-distance between

DGM and DEM the better the model selection method

• For simplification we subtracted JS-distances obtained with our own Bayesian method from the distances obtained with other methods

• We took average of these differences over 100 replicates

Page 41: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Data I AIC BIC BIC2 CSP EBP CSP1 Flat Flat1

Z-scores

i. 2 16.8 0.6 -0.6 -1.4 -2.2 -1.2 -1.8 -1.5

i. 10 1.6 7.0 3.8 1.6 1.0 3.5 1.6 3.2

i. 30 4.1 13.5 10.3 1.9 1.7 4.3 2.0 4.5

ii. 2 8.7 0.9 2.4 1.6 -1.3 2.2 1.6 2.5

ii. 10 0.4 7.4 2.5 4.0 0.8 3.0 0.5 2.0

ii. 30 1.5 15.4 14.7 8.2 2.2 7.4 -1.4 5.3

iii. 2 7.0 2.8 1.3 -1.3 -0.7 1.0 1.2 1.0

iii. 10 1.9 13.8 8.1 1.6 2.4 4.6 4.2 5.6

iii. 30 11.9 13.9 13.9 5.0 4.9 8.7 5.5 9.7

Average 0.60 0.84 0.63 0.24 0.10 0.37 0.15 0.36

Averages

i. 2 5.65 0.06 -0.05 -0.09 -0.06 -0.07 -0.12 -0.09

i. 10 0.16 4.89 0.55 0.04 0.02 0.45 0.08 0.39

i. 30 0.78 58.12 17.48 0.24 0.08 5.74 0.23 4.28

ii. 2 1.42 0.01 0.08 0.03 -0.03 0.07 0.05 0.07

ii. 10 0.03 3.01 0.25 0.47 0.05 0.32 0.02 0.18

ii. 30 0.22 12.64 12.19 1.66 0.30 4.15 -0.15 3.47

iii. 2 1.13 0.27 0.11 -0.08 -0.03 0.09 0.11 0.09

iii. 10 0.15 13.61 3.67 0.19 0.21 1.90 0.65 1.47

iii. 30 5.82 13.88 13.88 0.59 0.51 8.93 1.72 10.70

Average 1.70 11.83 5.35 0.34 0.12 2.40 0.29 2.29

Upper box shows the Z-scores (mean(diff)/std(diff)*sqrt(100))

Lower box shows the average difference

Shaded Z-scores: x > 3, a strong support in favour our method

underlined Z-scores: x < 0, any result against our method

Summary: AIC bad on two classes, (overestimates) BIC (and Mod-BIC) bad on 10 and 30 classes (underestimates)Flat1 and CSP1 weak on 10 and 30 classes (overestimates)

Page 42: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Large segmentsDetailed view

Rows show D results for datasets with 2, 10 and 30 classesD from segmentation selected by Bayes model with MEBPPositive results=> BM with MEBP outperformsnegative results=> method in question outperforms BM with MEBP 1 column: Mainly worse methods2 column: Mainly better methods

These results did not depend on the DJS prior

Page 43: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Large segmentsin small data

This is data where prior information is less reliable. (smaller dataset)

Flat class outperforms our prior in 30 class dataset

Page 44: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Small segments

Hardest data to modelThis is data where prior affects the evaluation significantly. Without prior BIC methods give best result (=1 segment is considered best)

Page 45: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Summary from art. data• MEBP had overall better result in 23/24 pairwise

comparisons with 30 class datasets (in 18/24 Z-score > 3)• MEBP had better overall result in all pairwise

comparisons with 10 class datasets (in 12/24 Z-score > 3)• Our method slightly outperformed by other bayes

methods in dataset i with 2 classes. Also EBP slightly outperforms it with every 2 class datasetEBP might be better for smaller class numbersMEBP underestimates the optimum here

• ML methods and priors with ps = 1 (Flat1, CSP1) had weakest performance

Page 46: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Analysis of real biological data• Yeast cell cycle time series gene expression data• Genes were clustered with k-means into 3 groups, 4

groups, 5 groups, and 6 groups• Order of genes in chromosomes, and gene

associations with expression clusters were turned into multidimensional multinomial data

• Aim was to locate regional similarities in gene exression in yeast cell cycle

Page 47: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

CHR Rand. mean Rand. std log(P(M|D)) Goodness

1 -726.39 3.86 -711.47 3.87

2 -2783.24 5.17 -2759.31 4.62

3 -1134.89 6.65 -1103.91 4.66

4 -5331.72 8.80 -5160.64 19.44

5 -1899.52 3.62 -1889.82 2.68

6 -792.07 4.90 -752.02 8.17

7 -3548.24 6.34 -3523.82 3.85

8 -1982.86 2.46 -1969.82 5.31

9 -1502.43 6.71 -1492.22 1.52

10 -2589.06 3.36 -2543.79 13.48

11 -2185.09 9.37 -2167.20 1.91

12 -3693.34 4.60 -3658.42 7.58

13 -3176.61 5.06 -3166.51 2.00

14 -2641.54 6.02 -2612.29 4.86

15 -3719.47 6.80 -3693.68 3.79

16 -3157.52 3.77 -3150.92 1.75

Anything in real data

• Each chromosome was segmented

• Segmentation score of each chromosome was compared to score from randomized data (100 randomizations)

• Goodness:

(x –mean(rand))/std(rand)

Page 48: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Conclusions

• Showed a Bayes Model, that outperforms in overall ML based methods

• Proposed a modified prior, that performs better than other tested priors with datasets having many classes

• Proposed a way of testing various methods– avoids picking too detailed models– use of prior can be considered a drawback

• Showed the preference to ML score when segmenting data with very weak signals

• Real data has localized signal

Page 49: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Future points

• Improve the heuristic (optimize the results)• Use of fuzzy vs. hard cluster classifications• Various other potential applications (no certainty

of their rationality yet..)• Should clusters be merged? (Work done in HIIT,

Mannila’s group)• Consider sound ways of setting the prior for DJS

calculus• Length of the gene, density of genes?

Page 50: Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen,

Thank you!

=Wake up!