Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics,...

35
Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin

Transcript of Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics,...

Page 1: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Computational Diagnostics

A new research group at the

Max Planck Institute for molecular Genetics,

Berlin

Page 2: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Will the patient respond to this drug?

?

Page 3: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

computational diagnostics

A simple solution for simple problems

Find all genes that are induced at least x-fold and use them to predict clinical outcomes

Page 4: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

computational diagnostics

Statistical Modeling

Experimental Design, Quality Control, Scaling, Normalization, Dimension Reduction, Predictive Classification, Quantifying the Evidence, Identifying the Evidence

Page 5: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

computational diagnostics

Computational Infrastructure and more Data

Databases, Automatic Uploading, Standard Analysis Protocols, Analysis Software, Query Language, Understanding the disease, Designing a small Diagnostic Chip

Page 6: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

computational diagnostics

Clinical Practice

Large Patient Databases complemented by expression profiles monitoring the Epidemiology of the disease

Page 7: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Breast Cancer, Expression Profiles and

Binary Regression in 7000 Dimensions

Rainer Spang, Harry Zuzan, Carrie Blanchette, Erich Huang, Holly Dressman, Jeff Marks,

Joe Nevins, Mike West

Duke Medical Center & Duke University

Page 8: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Estrogen Receptor Status

• 7000 genes• 49 breast tumors• 25 ER+• 24 ER-

Page 9: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Tumor – Chip - 7000 Numbers

Page 10: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

We Assume That the Following Steps Are Done:

• Choosing the patients• Doing the surgery• Handling the tissues• Preparing mRNA• Hybridizing the chips• Image analysis• Excluding low quality data• Normalization• Scaling

Page 11: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Page 12: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

How Much Evidence Is There?

I am 80% sure The probability that

I know it the patient has xxx

It was a guess given the profile is

0.8, 1, 0.5

Page 13: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Given

7000 Numbers

Wanted

89%

The probability that the tumor is ER+

Page 14: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

7000 Numbers Are More Numbers Than We Need

Predict ER status based on the expression levels of super-genes

Page 15: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.
Page 16: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Overfitting: We Can Not Identify a Model

• There are many different models that assign high probabilities for ER+ tumors and low probabilities for ER- tumors in the training set

• For a new patient we find among these models some that support that she is ER+ and others that predict she is ER-

• ???

Page 17: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Given the Few Profiles With Known Diagnosis:

• The uncertainty on the right model is high

• The variance of the model-weights is large

• The likelihood landscape is flat• We need additional model

assumptions to solve the problem

Page 18: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Informative Priors

Likelihood Prior Posterior

Page 19: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

If the Prior Is Chosen Badly:

• We can not reproduce the diagnosis of the training profiles any more

• We still can not identify the model• The diagnosis is driven mostly by

the additional assumptions and not by the data

Page 20: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

The Prior Needs to Be Designed in 49

Dimensions

• Shape?• Center?• Orientation?• Not to narrow ... not to wide

Page 21: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Shape

multidimensional normal

for simplicity

Page 22: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Center

Assumptions on the model correspond to assumptions on the

diagnosis

Page 23: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Orientation

orthogonal super-genes !

Page 24: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Not to Narrow ... Not to Wide

Auto adjusting model

Scales are hyper parameters with their own priors

Page 25: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

What are the additional assumptions

that came in by the prior?

• The model can not be dominated by only a few super-genes ( genes! )

• The diagnosis is done based on global changes in the expression profiles influenced by many genes

• The assumptions are neutral with respect to the individual diagnosis

Page 26: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.
Page 27: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Which Genes Have Driven the Prediction ?

Gene Weight

nuclear factor 3 alpha 0.853

cysteine rich heart protein 0.842

estrogen receptor 0.840

intestinal trefoil factor 0.840

x box binding protein 1 0.835

gata 3 0.818

ps 2 0.818

liv1 0.812

... many many more ... ...

Page 28: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Cysteine Rich Heart Protein

Page 29: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Summary ... so far

• We have solved a relatively simple computational diagnostics problem (ER-status in human breast cancers)

• Probit model• Overfitting is a problem• Additional model assumptions

do the trick

Page 30: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

A Common Problem With Expression Profiles

• We do not have enough samples to answer a certain question

• A possible strategy: Introduce additional model

assumptions

Page 31: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Differential Expression I

Setup: Two conditions ( healthy vs sick ), some repetitions, 10 000 genes

Which genes are up or down regulated ?

The most basic question

Good because it is a hypothesis free approach

Page 32: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Differential Expression II

10 000 degrees of freedom

A very bad multiple testing problem

It is possible in principal, but might require many replications depending on signal to noise ratios

SAM: regularized t-statistic + permutation based false positive rates

Hard to improve the analysis because it is a hypothesis free approach

Page 33: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Clustering of Genes

• Setup: many different conditions - time series - multiple knock-outs

• 100% explorative analysis• Essentially it is rearranging the data• Good for finding hypotheses but not

for verifying them

Page 34: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Clustering of Profiles (Patients)

• Maybe we can find new disease types or refine existing ones

• Completely different results when different sets of genes are used

• No predictive analysis

Page 35: Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Think About Data Analysis Ahead of Time

Collect possible questions on the data

Which of them are easy ? - Biologists and Bioinformaticians might have a different take on that -

Compare: number of samples vs. degrees of freedom

It is possible to compensate lack of data with model assumptions: Which assumptions make sense ?

More complex question can be the easier ones if they allow for an appropriate model