Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio...

Post on 04-Jan-2016

212 views 0 download

Tags:

Transcript of Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio...

Improved Bayesian segmentation with a novel application in genome

biology Petri Pehkonen, Kuopio University

Garry Wong, Kuopio University

Petri Törönen, HY, Institute of Biotechnology

Outline

• little motivation• some heuristics used• proposed Bayes model

– represents also a modified Dirichlet prior

• proposed testing with artificial data– discusses the use of prior information in the

evaluation

• little analysis of real datasets

Biological problem setupInput• Genes and their associations with biological features like

regulation, expression clusters, functions etc.

Assumption• Neighbouring genes of genome may share same features

Aim• Find the chromosomal regions "over-related" to some

biological feature or combination of features, look for non-random localization of features

I will discuss more about gene expression data application

A comparison with some earlier work with expression data

• Our aim is to analyze the gene expression from the genome with a new perspective– standard: Consider very

local areas of ~constant expression levels

– our view: How about looking at larger regions that have clearly more active genes (under certain conditions)?

Our perspective is related with the idea of active and passive regions of the genome

Further comparison with earlier work

• Standard: Up/Down/No regulation classification or real value from each experiment as input vector for gene

• Our idea: One can also associate genes to clusters in varying clustering solutions.multinomial variable/vector for single geneBy using varying number of clusters one should obtain

broader and narrower classes

This is related with the idea of combining weak coherent signals occuring in various measurements with clusters

Methodological problem setupGene participance in co-expression clusters• Genes can be partitioned into separate clusters according to

expression similarity: first 2 clusters, then 3, then 4 etc.

• Aim is to find chromosomal regions where consecutive genes are in same expression clusters in different clustering results

6 gene expression clusters 0 0 1 5 2 3 6 5 0 3 3 3 4 0 0

5 gene expression clusters 0 0 5 4 5 2 1 2 0 4 4 4 4 0 0

4 gene expression clusters 0 0 3 3 4 3 3 3 0 2 2 1 2 0 0

3 gene expression clusters 0 0 3 3 3 3 3 3 0 1 1 1 1 1 0

2 gene expression clusters 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0

Gene order in chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Broader expression similarity

Specific expression similarity

Existing segmentation algorithmsNon-heuristic:• Dynamic programming

Heuristic:• Hierarchical

– Top-down/bottom-up

– Recursive/iterative

• K-means reminding solutions (EM-methods)• Sliding window with adaptive window size (?)• etc. ..

Hierarchical vs. Non-hierarchical Heuristic methods

• Non-hierarchical heuristic methods usually produce only a single solution.

• compare k-means in clustering

• these often require a parameter (number of change-points )

• Aims to create (local) optimal solution for the number of change point

• Hierarchical heuristic methods produce a large group of solutions with varying number of change-points

• Large group of solutions can be created with one run

• Solutions could be usually optimized further

Recursive vs. Iterative hierarchical heuristics

• Recursive hierarchical heuristics• Slices usually until some stopping rule (BIC penalty) is

fullfilled.• Each segment is sliced independent from the rest of the data• Hard to obtain a solution for varying number of change points • Designed to stop at optimum (which can be a local optimum)

• Top-Down (?) hierarchical heuristics• Slices until a stopping rule or maximum number of clusters is

fullfilled• Each new change-point is placed after all segments are

analyzed. The best change-point from all segments is selected.• Creates a chain of solutions with varying number of segments• Can be run past the (local) optimum to see if we can find a

better solution after few bad results.

Our choice for heuristic search

• Top-Down hierarchical segmentation

How to place a new change-point

The new change-point position is usually selected using a statistical measure:

• Optimization of Log of Likelihood-ratio (ratio of ML based solutions)

• lighter to calculate• often referred as Jensen-Shannon Divergence

• Optimization of our Bayes Factor• bayes model discussed later• natural choice (as this is what we want to optimize)

Bayes factor would seem natural but:In testing we noticed that we started splitting only the smallest segments?? Why???

Bias in bayesian scoreThe first fig. represents random data (no preferred change-point position)The second fig. represents the behaviour of log likelihood ratio model. (ML method)The third fig. represents the behaviour of Bayes Factor (BF)•The highest point of each profile is taken as change point•Notice the bias in BF that favours cutting near the ends•Still all BFs are negative (against splicing)Problems when we force the algorithm to go pass the local optimum

=> We chose the ML to change point search

What we have obtained so far…

• Top-Down hierarchical heuristical segmentation

• ML based measure (JS divergence) used to select the next change-point

Selecting optimal solution from hierarchy

• Hierarchical segmentation for n sized data contains n different nested solutions

• The solutions must be evaluated in order to find proper one: not too general, not too complex

• Need model selection

Model selection used for segmentation models

Two ideas occuring in the most used methods:• Evaluating "the fit" of model (usually ML score)• Penalization for used parameters (in the model)

– Segmentation model parameters: data classes in segments and the positioning of change points

We used few (ML based) model selection methods:• AIC• BIC• Modified BIC (designed for segmentation tasks)

We were not happy with their performance, therefore ..

Our model selection criterion• Bayesian approach => Takes account uncertainity

and a priori information on parameters• The change-point model M includes two varying

parameter groups:A. Class proportions within segments

B. Change-points (segment borders)

• Posterior probability for the model M fitted in data D would be to integrate over A and B parameter spaces:

dAdBBPAPBAMDPMDPDMP )()(),,|()|()|(

Our approximations/assumptions: A

• clusters do not affect each other (independence)• data dimensions do not affect each other

(independence)

These two allow simple multiplication• segmentation does not affect directly the modeling

the modelling of the data in the cluster

Only the model and the prior A affect the likelihood of the data: P(D|M,A,B) = P(D|M,A)

Our model selection criterion

• Therefore a Multinomial model with a Dirichlet prior can be used to calculate the integrated likelihood:

V

v

I

i vi

viviI

ivi

I

ivi

I

ivi

v

vv

v

x

x

dAAPAMDP

1 1

11

1

)(

)(

)(),|(

multiplication over classes

in one dimensionmultiplication over

dimensions

• We assume all the change-points exchangeable• order of finding change-points does not matter for the solution

• We do not integrate over the parameter space B, but analyze only the MAP solution

• need a proper prior for B..

Further assumptions/approximations: B

• We select flat prior for simplicity– This makes the MAP equal to ML solution

• Prior of parameters B is 1 divided with how many ways the current m change-point estimates can be positioned into data with size n:

m

nBP

11

)(

Our model evaluation criterion

• The final form of our criterion is (without the log):

"Flat" MAP-estimate for parameters B.

Posterior probability ofparameter group A

Our model evaluation criterion

C

c

V

v

I

i cvi

cvicviI

icvi

I

icvi

I

icvi

MAP

v

vv

v

x

xm

N

dkAPAMDPBPMDP

1 1 1

11

1

)(

)(1

1

)(),|()()|(

Multiplication goes over various clusters (c), and various dimensions (v). Quite simple equation.

What about the Dirichlet prior weights

Multinomial model requires prior parameters:

• Standard Dirichlet prior weights:

I) all the prior weights same (FLAT)

II) prior probabilities equal the class probabilities in the whole dataset (CSP)

• These require the definition of prior sum (ps)

We used ps = 1 (CSP1, FLAT1) and ps = number of classes (CSP, FLAT) for both of previous prios

• Empirical Bayes (?) prior (EBP): prior II with ps = SQRT(Nc) (Carlin, others, ’scales according the std’)

• We considered EBP reasonable but…• With small class proportions and small clusters

EBP problematic– gamma function of Dirichlet equation probably

approaches infinity (as prior approaches zero)

• Modified EBP (MEBP) mutes this behaviour:

Instead of

• now prior weights approach 0 slower, when class proportion is small

…Dirichlet prior weights

)( iXPNci

)( iXPNci

• Also now the ps in MEBP is dependent on the class distribution (more even distribution => bigger ps). Also larger number of classes => bigger ps. Both these sound natural…

• Prior weight can be also linked to Chi Square test.

…Dirichlet prior weights

What we have obtained so far…

• Top-Down hierarchical heuristical segmentation

• ML based measure (JS divergence) used to select the next change-point

• Results from heuristics are analyzed using proposed Bayes model– flat prior using number of potential solutions for

segmentation with same m.– MEBP prior for multinomial data

Evaluation

• testing using artificial data• we can vary number of clusters, number of classes, class

distributions and monitor the performance

• Do hierarchical segmentation • Select the best result with various methods• Standard measure for evaluation: Compare how

well the clusters obtained correspond to clusters used in the data generation

• But is the good correlation/correspondence always what we want to see?

When correlation fails

• many clusters/segments and few data points• consecutive small segments • similar neighboring segments• One segment in the obtained segmentation (or in

the data-generation) => no correspondence

Problem: Correlation does not account Occam’s Razor

Our proposal • Base the evaluation to the similarity of the statistical

model used to generate each data point (DGM) vs. the data model obtained from clustering for the datapoint (DEM)– Reminds standard cross validation

• Use a probability distribution distance measure to monitor how similar they are– one can think this as infinite size testing data set

• Need only to select the distance measure• extra-plus: with hierarchical results we can look the

optimal result and see if a method overestimates or underestimates it.

Probability distribution distance measures

• Kullback-Leibler Divergence (most natural)

• Inverse of the KL

• Jensen-Shannon Divergence

• Other measures were also tested..

))(/)(log()()/log()||( iYPiXPiXPYXEYXD XKL

)||()||(_ XYDYXD KLInvKL

)2/)(||()2/)(||()||( YXYDYXXDYXD KLKLJS

X is here DGMY is the DEM (obtained from segments)

The Good, the Bad and…

• DEM can have data points with P(X=i) = 0– These create infinite score in DKL

– Under-estimates the optimal model

• DKL_Inv was considered to correct this, but– P(Xi) = 0 causes now too many zero scores

• x*log(x) when x => 0 was defined as 0

– over-estimates heavily the model

• DJS was selected as a comprimise between these two phenomenas

Do we want to use prior info• Standard cross validation: Bayes method uses prior

information, ML does not use prior information• Is this fair?

– same result with and without prior gets different score – method with prior usually gets better results

• Our (=My!) opinion: evaluation should use same amount of prior info for all the methods! – we would get same score for same result (independent from the

method)– we would pick the model from the model group that usually

performs better• Selecting the prior for evaluation process is now an open

question!

Defending note

• Amount of prior only affects the results from one group of artificial datasets analyzed (sparse signal /small clusters)

• These are datasets where bayes methods behave differently.

• Revelation from the results: ML methods perform worse also in datasets where prior has little effect

• => Use of prior mainly important for our Bayes method prior comparisons…

Rules for selecting priorto model evaluation

• Obtained DEM should be as close to DGM as possible (=more correct, smaller DJS)

• The used prior should be based on something else than our favorite MEBP

• Hoping we would not get good results with MEBP just because of the same prior

• Use as little prior as possible• want the segment area to have as much affect as possible

• Better ideas?

Comparison of model evaluation priors

• Used a small cluster data with 10 and 30 classes (=prior effects the results)

• Used CSP (class prior = class probability* ps), with ps = 1, 2, c/4, c/2, c*3/4, c, 10*c (c = number of classes)

• Looked for obtained DJS for various segmenting outcomes (from hierarchical results) with 1 – n clusters (n =max(5,k), k= artificial data cluster number)

• Analysis was done with artificial datasets

0 1 2 2.5 5 7.5 10 100

0

10

20

30

40

50

Jen

sen

-Sh

an

no

n d

ive

rge

nce

Prior sum

Data with 10 classes

0 1 2 7.5 15 22.5 30 3000

20

40

60

80

Prior sum

Data with 30 classes

Comparison of model evaluation priors

•ps = 1, 2, c/4, c/2, 3*c/4, c, 10*c

The approximate minimum at ps = number of classes

0 1 2 2.5 5 7.5 10 1000

10

20

30

40

50

Jen

sen

-Sh

an

no

n d

ive

rge

nce

Prior sum

Data with 10 classes

0 1 2 7.5 15 22.5 30 3000

20

40

60

80

Prior sum

Data with 30 classes

Comparison of priors• We did not look minimum, but wanted the compromise

between minimum and weak prior effect:

• We chose ps = c/2

• Choice quite arbitrary but a quick analysis with neighbor priors gave similar results

Proposed method+artificial data evaluation

• Top-Down hierarchical heuristical segmentation with ML used to select the next change-point

• Results from heuristics are analyzed using proposed Bayes model

• Evaluation of the results using the artificial data– estimate how well the obtained model predicts the future

data sets

– compare the models with DJS that uses also prior information

More on evaluation• Three data types (with varying number of

classes):i) several (1 – 10) large segments (each 30 – 300 data points)

• this should be ~easy to analyze

ii) few (1 – 4) large segments (30 – 300 data points)• this should have less reliable prior class distribution

iii) several (1 – 10) small segments (15 – 60 data points)• most difficult to analyze• prior affects these results

• Number of classes used in each: 2, 10, 30– data sparseness increases with increasing number of

classes

• Data classes were made skewed

…evaluation…

• Data segmented by Top-Down: 1 – 100 segments• Model selection methods used to pick optimal

segmentation– ML methods: AIC, BIC, modified BIC– Bayes method with dirichlet priors: FLAT1, FLAT,

CSP1, CSP, EBP, MEBP

• Each test replicated 100 times• Djs calculated between DGM and the obtained

DEM

…still evaluating• As mentioned: the smaller the JS-distance between

DGM and DEM the better the model selection method

• For simplification we subtracted JS-distances obtained with our own Bayesian method from the distances obtained with other methods

• We took average of these differences over 100 replicates

Data I AIC BIC BIC2 CSP EBP CSP1 Flat Flat1

Z-scores

i. 2 16.8 0.6 -0.6 -1.4 -2.2 -1.2 -1.8 -1.5

i. 10 1.6 7.0 3.8 1.6 1.0 3.5 1.6 3.2

i. 30 4.1 13.5 10.3 1.9 1.7 4.3 2.0 4.5

ii. 2 8.7 0.9 2.4 1.6 -1.3 2.2 1.6 2.5

ii. 10 0.4 7.4 2.5 4.0 0.8 3.0 0.5 2.0

ii. 30 1.5 15.4 14.7 8.2 2.2 7.4 -1.4 5.3

iii. 2 7.0 2.8 1.3 -1.3 -0.7 1.0 1.2 1.0

iii. 10 1.9 13.8 8.1 1.6 2.4 4.6 4.2 5.6

iii. 30 11.9 13.9 13.9 5.0 4.9 8.7 5.5 9.7

Average 0.60 0.84 0.63 0.24 0.10 0.37 0.15 0.36

Averages

i. 2 5.65 0.06 -0.05 -0.09 -0.06 -0.07 -0.12 -0.09

i. 10 0.16 4.89 0.55 0.04 0.02 0.45 0.08 0.39

i. 30 0.78 58.12 17.48 0.24 0.08 5.74 0.23 4.28

ii. 2 1.42 0.01 0.08 0.03 -0.03 0.07 0.05 0.07

ii. 10 0.03 3.01 0.25 0.47 0.05 0.32 0.02 0.18

ii. 30 0.22 12.64 12.19 1.66 0.30 4.15 -0.15 3.47

iii. 2 1.13 0.27 0.11 -0.08 -0.03 0.09 0.11 0.09

iii. 10 0.15 13.61 3.67 0.19 0.21 1.90 0.65 1.47

iii. 30 5.82 13.88 13.88 0.59 0.51 8.93 1.72 10.70

Average 1.70 11.83 5.35 0.34 0.12 2.40 0.29 2.29

Upper box shows the Z-scores (mean(diff)/std(diff)*sqrt(100))

Lower box shows the average difference

Shaded Z-scores: x > 3, a strong support in favour our method

underlined Z-scores: x < 0, any result against our method

Summary: AIC bad on two classes, (overestimates) BIC (and Mod-BIC) bad on 10 and 30 classes (underestimates)Flat1 and CSP1 weak on 10 and 30 classes (overestimates)

Large segmentsDetailed view

Rows show D results for datasets with 2, 10 and 30 classesD from segmentation selected by Bayes model with MEBPPositive results=> BM with MEBP outperformsnegative results=> method in question outperforms BM with MEBP 1 column: Mainly worse methods2 column: Mainly better methods

These results did not depend on the DJS prior

Large segmentsin small data

This is data where prior information is less reliable. (smaller dataset)

Flat class outperforms our prior in 30 class dataset

Small segments

Hardest data to modelThis is data where prior affects the evaluation significantly. Without prior BIC methods give best result (=1 segment is considered best)

Summary from art. data• MEBP had overall better result in 23/24 pairwise

comparisons with 30 class datasets (in 18/24 Z-score > 3)• MEBP had better overall result in all pairwise

comparisons with 10 class datasets (in 12/24 Z-score > 3)• Our method slightly outperformed by other bayes

methods in dataset i with 2 classes. Also EBP slightly outperforms it with every 2 class datasetEBP might be better for smaller class numbersMEBP underestimates the optimum here

• ML methods and priors with ps = 1 (Flat1, CSP1) had weakest performance

Analysis of real biological data• Yeast cell cycle time series gene expression data• Genes were clustered with k-means into 3 groups, 4

groups, 5 groups, and 6 groups• Order of genes in chromosomes, and gene

associations with expression clusters were turned into multidimensional multinomial data

• Aim was to locate regional similarities in gene exression in yeast cell cycle

CHR Rand. mean Rand. std log(P(M|D)) Goodness

1 -726.39 3.86 -711.47 3.87

2 -2783.24 5.17 -2759.31 4.62

3 -1134.89 6.65 -1103.91 4.66

4 -5331.72 8.80 -5160.64 19.44

5 -1899.52 3.62 -1889.82 2.68

6 -792.07 4.90 -752.02 8.17

7 -3548.24 6.34 -3523.82 3.85

8 -1982.86 2.46 -1969.82 5.31

9 -1502.43 6.71 -1492.22 1.52

10 -2589.06 3.36 -2543.79 13.48

11 -2185.09 9.37 -2167.20 1.91

12 -3693.34 4.60 -3658.42 7.58

13 -3176.61 5.06 -3166.51 2.00

14 -2641.54 6.02 -2612.29 4.86

15 -3719.47 6.80 -3693.68 3.79

16 -3157.52 3.77 -3150.92 1.75

Anything in real data

• Each chromosome was segmented

• Segmentation score of each chromosome was compared to score from randomized data (100 randomizations)

• Goodness:

(x –mean(rand))/std(rand)

Conclusions

• Showed a Bayes Model, that outperforms in overall ML based methods

• Proposed a modified prior, that performs better than other tested priors with datasets having many classes

• Proposed a way of testing various methods– avoids picking too detailed models– use of prior can be considered a drawback

• Showed the preference to ML score when segmenting data with very weak signals

• Real data has localized signal

Future points

• Improve the heuristic (optimize the results)• Use of fuzzy vs. hard cluster classifications• Various other potential applications (no certainty

of their rationality yet..)• Should clusters be merged? (Work done in HIIT,

Mannila’s group)• Consider sound ways of setting the prior for DJS

calculus• Length of the gene, density of genes?

Thank you!

=Wake up!