Molecular Classification of Biological...

132
Molecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics November 13, 2008 Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Transcript of Molecular Classification of Biological...

Page 1: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Molecular Classification of Biological Phenotypes

Esfandiar Haghverdi

School of Informatics

November 13, 2008

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 2: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Outline

I Introduction

I Class Comparison

I Class Discovery

I Class Prediction

I Example

I Biological states and state modulation

I Chemical Genomics

I Toxicogenomics

I Software Tools

I Ideas

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 3: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.

I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.I Prepackaged analysis tools are not a good substitute for

collaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 4: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.

I Extract meaningful information about the system beingstudied from gene expression data.

I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.I Prepackaged analysis tools are not a good substitute for

collaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 5: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.

I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.I Prepackaged analysis tools are not a good substitute for

collaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 6: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.

I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.I Prepackaged analysis tools are not a good substitute for

collaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 7: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.

I There is no one-size-fits-all solution for the analysis andinterpretation of genome wide data.

I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.I Prepackaged analysis tools are not a good substitute for

collaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 8: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.

I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.I Prepackaged analysis tools are not a good substitute for

collaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 9: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.

I An understanding of both biology and the computationalmethods is essential.

I Complex mathematical methods do not necessarily performbetter than simpler ones.

I Prepackaged analysis tools are not a good substitute forcollaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 10: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.

I Complex mathematical methods do not necessarily performbetter than simpler ones.

I Prepackaged analysis tools are not a good substitute forcollaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 11: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.

I Prepackaged analysis tools are not a good substitute forcollaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 12: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Introduction

I Welcome to Genomics Data Mining (GDM) working group.I Meetings will be bi-weekly starting today.I Extract meaningful information about the system being

studied from gene expression data.I The work is motivated by recent progress in cancer genomics.I This overview is by no means comprehensive.I There is no one-size-fits-all solution for the analysis and

interpretation of genome wide data.I Many analysis options are available at all phases of analysis.I An understanding of both biology and the computational

methods is essential.I Complex mathematical methods do not necessarily perform

better than simpler ones.I Prepackaged analysis tools are not a good substitute for

collaboration with computational/statistical scientists oncomplex problems.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 13: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Central Ideas

I Gene expression signatures: One can rationally distill a list ofgenes from an unbiased global scan of gene-expressionchanges observed across a carefully selected sample set.

I Biological states can be characterized by gene expressionsignatures.

I We can use gene expression signatures as surrogates forbiological states.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 14: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Central Ideas

I Gene expression signatures: One can rationally distill a list ofgenes from an unbiased global scan of gene-expressionchanges observed across a carefully selected sample set.

I Biological states can be characterized by gene expressionsignatures.

I We can use gene expression signatures as surrogates forbiological states.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 15: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Central Ideas

I Gene expression signatures: One can rationally distill a list ofgenes from an unbiased global scan of gene-expressionchanges observed across a carefully selected sample set.

I Biological states can be characterized by gene expressionsignatures.

I We can use gene expression signatures as surrogates forbiological states.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 16: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Central Papers

I Gene expression signatures: Molecular Classification ofCancer, T.R. Golub et al., Science 286, 531, 1999.

I GSEA: Gene Set Enrichment Analysis, A. Subramanian et al.,PNAS, 102, 2005.

I GE-HTS: Gene Expression-based High-throughput Screening,K. Stegmaier et al., Nature genetics 36, 257, 2004.

I Connectivity Map: The Connectivity Map, J. Lamb et al.,Science 313, 1929, 2006.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 17: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Central Papers

I Gene expression signatures: Molecular Classification ofCancer, T.R. Golub et al., Science 286, 531, 1999.

I GSEA: Gene Set Enrichment Analysis, A. Subramanian et al.,PNAS, 102, 2005.

I GE-HTS: Gene Expression-based High-throughput Screening,K. Stegmaier et al., Nature genetics 36, 257, 2004.

I Connectivity Map: The Connectivity Map, J. Lamb et al.,Science 313, 1929, 2006.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 18: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Central Papers

I Gene expression signatures: Molecular Classification ofCancer, T.R. Golub et al., Science 286, 531, 1999.

I GSEA: Gene Set Enrichment Analysis, A. Subramanian et al.,PNAS, 102, 2005.

I GE-HTS: Gene Expression-based High-throughput Screening,K. Stegmaier et al., Nature genetics 36, 257, 2004.

I Connectivity Map: The Connectivity Map, J. Lamb et al.,Science 313, 1929, 2006.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 19: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Central Papers

I Gene expression signatures: Molecular Classification ofCancer, T.R. Golub et al., Science 286, 531, 1999.

I GSEA: Gene Set Enrichment Analysis, A. Subramanian et al.,PNAS, 102, 2005.

I GE-HTS: Gene Expression-based High-throughput Screening,K. Stegmaier et al., Nature genetics 36, 257, 2004.

I Connectivity Map: The Connectivity Map, J. Lamb et al.,Science 313, 1929, 2006.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 20: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Comparison

I Goal: Identify differentially expressed genes.

I Methods:I Calculate a test statistic (t-test, ANOVA F statistic,

non-parametric rank-based,. . .)I Determine the significance of the observed value for test

statistic.I Normality, equal variance, multiple testing, FWER vs FDR

I Issues:I Two or more experimental conditionsI Conditions may be independent or related (time series)I Many different combinations of experimental variablesI Replication, to estimate variability, to identify biologically

reproducible changesI How to incorporate estimates of variation (model-based

methods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 21: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Comparison

I Goal: Identify differentially expressed genes.I Methods:

I Calculate a test statistic (t-test, ANOVA F statistic,non-parametric rank-based,. . .)

I Determine the significance of the observed value for teststatistic.

I Normality, equal variance, multiple testing, FWER vs FDR

I Issues:I Two or more experimental conditionsI Conditions may be independent or related (time series)I Many different combinations of experimental variablesI Replication, to estimate variability, to identify biologically

reproducible changesI How to incorporate estimates of variation (model-based

methods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 22: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Comparison

I Goal: Identify differentially expressed genes.I Methods:

I Calculate a test statistic (t-test, ANOVA F statistic,non-parametric rank-based,. . .)

I Determine the significance of the observed value for teststatistic.

I Normality, equal variance, multiple testing, FWER vs FDR

I Issues:I Two or more experimental conditionsI Conditions may be independent or related (time series)I Many different combinations of experimental variablesI Replication, to estimate variability, to identify biologically

reproducible changesI How to incorporate estimates of variation (model-based

methods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 23: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Comparison

Opportunities:Time-series analysis:

I Regulatory pathway inference

I Yeast cell cycle (Fourier transform, . . .)

I Model organism (e.g., Drosophila, Daphnia) development

I Analysis of samples (cells) exposed to different doses of thesame drug

I Analysis of expression patterns from related bacterial strains

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 24: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery

I Goal: Identify meaningful patterns in the data, (aka.Unsupervised learning.)

I Methods:

I Dimensionality reduction methods: Most of the variation indata can be explained by a smaller number of transformedvariables.

I For example: SVD, PCA, MDS, . . ..

I Clustering: Data can be grouped into groups of similar pointsbased on some similarity measure.

I Aggregation methods (e.g., HC)I Partitioning or centroid methods (for example, k-means, SOM

or Kohonen maps)I Model-based methods (e.g., fitting into some mixture model)I Optimization techniques (within class, between class)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 25: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery

I Goal: Identify meaningful patterns in the data, (aka.Unsupervised learning.)

I Methods:

I Dimensionality reduction methods: Most of the variation indata can be explained by a smaller number of transformedvariables.

I For example: SVD, PCA, MDS, . . ..

I Clustering: Data can be grouped into groups of similar pointsbased on some similarity measure.

I Aggregation methods (e.g., HC)I Partitioning or centroid methods (for example, k-means, SOM

or Kohonen maps)I Model-based methods (e.g., fitting into some mixture model)I Optimization techniques (within class, between class)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 26: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery

I Goal: Identify meaningful patterns in the data, (aka.Unsupervised learning.)

I Methods:I Dimensionality reduction methods: Most of the variation in

data can be explained by a smaller number of transformedvariables.

I For example: SVD, PCA, MDS, . . ..

I Clustering: Data can be grouped into groups of similar pointsbased on some similarity measure.

I Aggregation methods (e.g., HC)I Partitioning or centroid methods (for example, k-means, SOM

or Kohonen maps)I Model-based methods (e.g., fitting into some mixture model)I Optimization techniques (within class, between class)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 27: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery

I Goal: Identify meaningful patterns in the data, (aka.Unsupervised learning.)

I Methods:I Dimensionality reduction methods: Most of the variation in

data can be explained by a smaller number of transformedvariables.

I For example: SVD, PCA, MDS, . . ..

I Clustering: Data can be grouped into groups of similar pointsbased on some similarity measure.

I Aggregation methods (e.g., HC)I Partitioning or centroid methods (for example, k-means, SOM

or Kohonen maps)I Model-based methods (e.g., fitting into some mixture model)I Optimization techniques (within class, between class)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 28: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery

I Goal: Identify meaningful patterns in the data, (aka.Unsupervised learning.)

I Methods:I Dimensionality reduction methods: Most of the variation in

data can be explained by a smaller number of transformedvariables.

I For example: SVD, PCA, MDS, . . ..

I Clustering: Data can be grouped into groups of similar pointsbased on some similarity measure.

I Aggregation methods (e.g., HC)I Partitioning or centroid methods (for example, k-means, SOM

or Kohonen maps)I Model-based methods (e.g., fitting into some mixture model)I Optimization techniques (within class, between class)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 29: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 30: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 31: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 32: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 33: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 34: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 35: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 36: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Issues:

I It is unbiased, no a priori assumption.

I The structure may not be of clinical or biological interest.

I Should be viewed as a first step in more detailed analysis.

I There is no single best way to evaluate a clustering method ora cluster.

I How to evaluate clustering methods?

I There is no single best clustering method for a data set.

I Some desirable properties maybe: Stability (reliability),predictive power, reduction power, ...

I How to choose the number of clusters (Gordon, repeatedsampling, gap statistic, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 37: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Opportunities:

I Stochastic clustering (e.g., NMF)

I Techniques from statistical physics (e.g., DeterministicAnnealing)

I Spectral methods (e.g., Diffusion maps on graphs)

I Geometric methods (e.g., Diffusion maps on manifolds)

I Information theoretic methods

I Statistical theory of clustering (Cf. comparing clusteringmethods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 38: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Opportunities:

I Stochastic clustering (e.g., NMF)

I Techniques from statistical physics (e.g., DeterministicAnnealing)

I Spectral methods (e.g., Diffusion maps on graphs)

I Geometric methods (e.g., Diffusion maps on manifolds)

I Information theoretic methods

I Statistical theory of clustering (Cf. comparing clusteringmethods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 39: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Opportunities:

I Stochastic clustering (e.g., NMF)

I Techniques from statistical physics (e.g., DeterministicAnnealing)

I Spectral methods (e.g., Diffusion maps on graphs)

I Geometric methods (e.g., Diffusion maps on manifolds)

I Information theoretic methods

I Statistical theory of clustering (Cf. comparing clusteringmethods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 40: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Opportunities:

I Stochastic clustering (e.g., NMF)

I Techniques from statistical physics (e.g., DeterministicAnnealing)

I Spectral methods (e.g., Diffusion maps on graphs)

I Geometric methods (e.g., Diffusion maps on manifolds)

I Information theoretic methods

I Statistical theory of clustering (Cf. comparing clusteringmethods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 41: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Opportunities:

I Stochastic clustering (e.g., NMF)

I Techniques from statistical physics (e.g., DeterministicAnnealing)

I Spectral methods (e.g., Diffusion maps on graphs)

I Geometric methods (e.g., Diffusion maps on manifolds)

I Information theoretic methods

I Statistical theory of clustering (Cf. comparing clusteringmethods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 42: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery cont’d

Opportunities:

I Stochastic clustering (e.g., NMF)

I Techniques from statistical physics (e.g., DeterministicAnnealing)

I Spectral methods (e.g., Diffusion maps on graphs)

I Geometric methods (e.g., Diffusion maps on manifolds)

I Information theoretic methods

I Statistical theory of clustering (Cf. comparing clusteringmethods)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 43: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Discovery Methodology

Expression Dataset

Scaling, Filtering and Normalization

Select Number of Classes

Cluster Data

Validation of Putative Classes

Discovered Classes

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 44: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction

I Goal: Design an accurate classifier (predictor) under theguidance of a supervisor, (aka. Supervised learning problem.)E.g., predicting cancer (sub)types, clinical outcomes, etc.

I Methods:I Linear and quadratic discriminant analysisI Weighted votingI Shrunken centroidsI k-NNI Neural netsI SVMI Decision tree classifiersI Naive BayesI Bagging and boosting (combining classifiers)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 45: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction

I Goal: Design an accurate classifier (predictor) under theguidance of a supervisor, (aka. Supervised learning problem.)E.g., predicting cancer (sub)types, clinical outcomes, etc.

I Methods:I Linear and quadratic discriminant analysisI Weighted votingI Shrunken centroidsI k-NNI Neural netsI SVMI Decision tree classifiersI Naive BayesI Bagging and boosting (combining classifiers)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 46: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction, cont’d

Issues:

I features >> samples

I Overfitting (modeling the training data too exactly)

I High level of noiseI Which method to choose?

I Careful with comparisonsI Some trends emerge (e.g, Diagonal LD does better than

Fisher’s LD, k-NN performs better after gene filtering,combined methods do better, simpler methods do better, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 47: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction, cont’d

Issues:

I features >> samples

I Overfitting (modeling the training data too exactly)

I High level of noiseI Which method to choose?

I Careful with comparisonsI Some trends emerge (e.g, Diagonal LD does better than

Fisher’s LD, k-NN performs better after gene filtering,combined methods do better, simpler methods do better, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 48: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction, cont’d

Issues:

I features >> samples

I Overfitting (modeling the training data too exactly)

I High level of noise

I Which method to choose?

I Careful with comparisonsI Some trends emerge (e.g, Diagonal LD does better than

Fisher’s LD, k-NN performs better after gene filtering,combined methods do better, simpler methods do better, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 49: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction, cont’d

Issues:

I features >> samples

I Overfitting (modeling the training data too exactly)

I High level of noiseI Which method to choose?

I Careful with comparisonsI Some trends emerge (e.g, Diagonal LD does better than

Fisher’s LD, k-NN performs better after gene filtering,combined methods do better, simpler methods do better, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 50: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction, cont’d

Issues:

I features >> samples

I Overfitting (modeling the training data too exactly)

I High level of noiseI Which method to choose?

I Careful with comparisons

I Some trends emerge (e.g, Diagonal LD does better thanFisher’s LD, k-NN performs better after gene filtering,combined methods do better, simpler methods do better, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 51: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction, cont’d

Issues:

I features >> samples

I Overfitting (modeling the training data too exactly)

I High level of noiseI Which method to choose?

I Careful with comparisonsI Some trends emerge (e.g, Diagonal LD does better than

Fisher’s LD, k-NN performs better after gene filtering,combined methods do better, simpler methods do better, ...)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 52: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction cont’d

Opportunities:

I Theory for method and classifier comparison

I Combining knowledge from different methods

I Incorporating knowledge from complementary sources

I Boosting the power and rigor of analysis

I Subpattern discovery, Califano et al. 99

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 53: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction cont’d

Opportunities:

I Theory for method and classifier comparison

I Combining knowledge from different methods

I Incorporating knowledge from complementary sources

I Boosting the power and rigor of analysis

I Subpattern discovery, Califano et al. 99

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 54: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction cont’d

Opportunities:

I Theory for method and classifier comparison

I Combining knowledge from different methods

I Incorporating knowledge from complementary sources

I Boosting the power and rigor of analysis

I Subpattern discovery, Califano et al. 99

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 55: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction cont’d

Opportunities:

I Theory for method and classifier comparison

I Combining knowledge from different methods

I Incorporating knowledge from complementary sources

I Boosting the power and rigor of analysis

I Subpattern discovery, Califano et al. 99

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 56: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction cont’d

Opportunities:

I Theory for method and classifier comparison

I Combining knowledge from different methods

I Incorporating knowledge from complementary sources

I Boosting the power and rigor of analysis

I Subpattern discovery, Califano et al. 99

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 57: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Marker Selection Methodology

Gene ExpressionDataset

Known Classes(Phenotype)

Scaling, Filtering and Normalization

Compute Gene−Class Correlationsand sort genes accordingly

(Feature Selection)

More Experiments

(Validation)Computational

Analysis

Visualization and

Biological Study of MarkersBuild Supervised Classifier

Predictive Model

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 58: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Class Prediction Methodology

Evaluate Predictor on

Independent Test Set

Test Classifier by Cross−Validation

Build Classifier(Training)

Compute Gene−Class Correlationsand sort Genes Accordingly

(Feature Selection)

Expression Data Known or Discovered Classes

Scaling, Filtering and Normalization

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 59: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example (Golub et al. 1999)

ALL vs AML (The Biology of Cancer, R. Weinberg)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 60: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Compare to this!!

Normal kidney vs Renal cell carcinoma.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 61: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d

I Sample: 38 bone marrow samples (27 ALL, 11 AML).6817 genes

I Test statistic: SNR = µ1−µ2σ1+σ2

to determine gene-classcorrelation

I Significance test: Permutation test (nhood analysis)

I Signature: 50 informative genes are selected

I Prediction: Given a new sample each informative gene casts aweighted vote, the votes are then summed to determine thewinning class and to define the 0 < PS < 1 (predictionstrength) which needs to be above 0.3 for each decisive vote.

I Validity of the predictor: Cross-validation and trying on testdata.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 62: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d

I Sample: 38 bone marrow samples (27 ALL, 11 AML).6817 genes

I Test statistic: SNR = µ1−µ2σ1+σ2

to determine gene-classcorrelation

I Significance test: Permutation test (nhood analysis)

I Signature: 50 informative genes are selected

I Prediction: Given a new sample each informative gene casts aweighted vote, the votes are then summed to determine thewinning class and to define the 0 < PS < 1 (predictionstrength) which needs to be above 0.3 for each decisive vote.

I Validity of the predictor: Cross-validation and trying on testdata.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 63: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d

I Sample: 38 bone marrow samples (27 ALL, 11 AML).6817 genes

I Test statistic: SNR = µ1−µ2σ1+σ2

to determine gene-classcorrelation

I Significance test: Permutation test (nhood analysis)

I Signature: 50 informative genes are selected

I Prediction: Given a new sample each informative gene casts aweighted vote, the votes are then summed to determine thewinning class and to define the 0 < PS < 1 (predictionstrength) which needs to be above 0.3 for each decisive vote.

I Validity of the predictor: Cross-validation and trying on testdata.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 64: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d

I Sample: 38 bone marrow samples (27 ALL, 11 AML).6817 genes

I Test statistic: SNR = µ1−µ2σ1+σ2

to determine gene-classcorrelation

I Significance test: Permutation test (nhood analysis)

I Signature: 50 informative genes are selected

I Prediction: Given a new sample each informative gene casts aweighted vote, the votes are then summed to determine thewinning class and to define the 0 < PS < 1 (predictionstrength) which needs to be above 0.3 for each decisive vote.

I Validity of the predictor: Cross-validation and trying on testdata.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 65: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d

I Sample: 38 bone marrow samples (27 ALL, 11 AML).6817 genes

I Test statistic: SNR = µ1−µ2σ1+σ2

to determine gene-classcorrelation

I Significance test: Permutation test (nhood analysis)

I Signature: 50 informative genes are selected

I Prediction: Given a new sample each informative gene casts aweighted vote, the votes are then summed to determine thewinning class and to define the 0 < PS < 1 (predictionstrength) which needs to be above 0.3 for each decisive vote.

I Validity of the predictor: Cross-validation and trying on testdata.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 66: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d

I Sample: 38 bone marrow samples (27 ALL, 11 AML).6817 genes

I Test statistic: SNR = µ1−µ2σ1+σ2

to determine gene-classcorrelation

I Significance test: Permutation test (nhood analysis)

I Signature: 50 informative genes are selected

I Prediction: Given a new sample each informative gene casts aweighted vote, the votes are then summed to determine thewinning class and to define the 0 < PS < 1 (predictionstrength) which needs to be above 0.3 for each decisive vote.

I Validity of the predictor: Cross-validation and trying on testdata.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 67: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}I Permutation test: compare with N1(c

∗, r), for 400permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noiseI Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 68: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}I Permutation test: compare with N1(c

∗, r), for 400permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noiseI Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 69: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}I Permutation test: compare with N1(c

∗, r), for 400permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noiseI Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 70: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}

I Permutation test: compare with N1(c∗, r), for 400

permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noiseI Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 71: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}I Permutation test: compare with N1(c

∗, r), for 400permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noiseI Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 72: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}I Permutation test: compare with N1(c

∗, r), for 400permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noiseI Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 73: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}I Permutation test: compare with N1(c

∗, r), for 400permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noise

I Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 74: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols ...

Suppose we have n samples.

I Gene vector: v(g) = (e1, · · · , en),

I Class vector: c = (c1, · · · , cn): ci = 1 if i ∈ AML, and 0 ifi ∈ ALL.

I Gene-Class Correlation:P(g , c) = (µ1(g)− µ2(g))/(σ1(g) + σ2(g)) for each gene g .

I Nhood: N1(c , r) = {g |P(g , c) = r}I Permutation test: compare with N1(c

∗, r), for 400permutations.

I Number of informative genes is a free parameter, chosen to be50: 25 top-most and 25 bottom-most.

I Robustness to noiseI Ease of applicability.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 75: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols cont’d

Predictor Design:

I vg = ag (xg − bg ) where ag = P(g , c),bg = [µ1(g) + µ2(g)]/2,

I xg normalized log expression level of gene g in sample x

I vg ≥ 0 means g votes for class 1 (AML) and vg < 0 means gvotes for class 2 (ALL).

I V1 =∑

g∈IG vg for vg ≥ 0, and

I V2 =∑

g∈IG |vg | for vg < 0.

I PS = (Vwin − Vlose)/(Vwin + Vlose)

I V1 > V2 with PS > 0.3 means x ∈ AML, if PS ≤ 0.3 thenuncertain.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 76: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols cont’d

Predictor Design:

I vg = ag (xg − bg ) where ag = P(g , c),bg = [µ1(g) + µ2(g)]/2,

I xg normalized log expression level of gene g in sample x

I vg ≥ 0 means g votes for class 1 (AML) and vg < 0 means gvotes for class 2 (ALL).

I V1 =∑

g∈IG vg for vg ≥ 0, and

I V2 =∑

g∈IG |vg | for vg < 0.

I PS = (Vwin − Vlose)/(Vwin + Vlose)

I V1 > V2 with PS > 0.3 means x ∈ AML, if PS ≤ 0.3 thenuncertain.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 77: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols cont’d

Predictor Design:

I vg = ag (xg − bg ) where ag = P(g , c),bg = [µ1(g) + µ2(g)]/2,

I xg normalized log expression level of gene g in sample x

I vg ≥ 0 means g votes for class 1 (AML) and vg < 0 means gvotes for class 2 (ALL).

I V1 =∑

g∈IG vg for vg ≥ 0, and

I V2 =∑

g∈IG |vg | for vg < 0.

I PS = (Vwin − Vlose)/(Vwin + Vlose)

I V1 > V2 with PS > 0.3 means x ∈ AML, if PS ≤ 0.3 thenuncertain.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 78: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols cont’d

Predictor Design:

I vg = ag (xg − bg ) where ag = P(g , c),bg = [µ1(g) + µ2(g)]/2,

I xg normalized log expression level of gene g in sample x

I vg ≥ 0 means g votes for class 1 (AML) and vg < 0 means gvotes for class 2 (ALL).

I V1 =∑

g∈IG vg for vg ≥ 0, and

I V2 =∑

g∈IG |vg | for vg < 0.

I PS = (Vwin − Vlose)/(Vwin + Vlose)

I V1 > V2 with PS > 0.3 means x ∈ AML, if PS ≤ 0.3 thenuncertain.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 79: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols cont’d

Predictor Design:

I vg = ag (xg − bg ) where ag = P(g , c),bg = [µ1(g) + µ2(g)]/2,

I xg normalized log expression level of gene g in sample x

I vg ≥ 0 means g votes for class 1 (AML) and vg < 0 means gvotes for class 2 (ALL).

I V1 =∑

g∈IG vg for vg ≥ 0, and

I V2 =∑

g∈IG |vg | for vg < 0.

I PS = (Vwin − Vlose)/(Vwin + Vlose)

I V1 > V2 with PS > 0.3 means x ∈ AML, if PS ≤ 0.3 thenuncertain.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 80: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols cont’d

Predictor Design:

I vg = ag (xg − bg ) where ag = P(g , c),bg = [µ1(g) + µ2(g)]/2,

I xg normalized log expression level of gene g in sample x

I vg ≥ 0 means g votes for class 1 (AML) and vg < 0 means gvotes for class 2 (ALL).

I V1 =∑

g∈IG vg for vg ≥ 0, and

I V2 =∑

g∈IG |vg | for vg < 0.

I PS = (Vwin − Vlose)/(Vwin + Vlose)

I V1 > V2 with PS > 0.3 means x ∈ AML, if PS ≤ 0.3 thenuncertain.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 81: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols cont’d

Predictor Design:

I vg = ag (xg − bg ) where ag = P(g , c),bg = [µ1(g) + µ2(g)]/2,

I xg normalized log expression level of gene g in sample x

I vg ≥ 0 means g votes for class 1 (AML) and vg < 0 means gvotes for class 2 (ALL).

I V1 =∑

g∈IG vg for vg ≥ 0, and

I V2 =∑

g∈IG |vg | for vg < 0.

I PS = (Vwin − Vlose)/(Vwin + Vlose)

I V1 > V2 with PS > 0.3 means x ∈ AML, if PS ≤ 0.3 thenuncertain.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 82: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols, cont’d

I Cross-validation: 36 were assigned classes (with 100%)accuracy and 2 were uncertain. Median PS = 0.77

I Test data (34 samples): 24 bone marrow and 10 peripheralblood samples, 20 ALL, 14 AML.Result: 29 predicted with 100% accuracy and 5 uncertain.Median PS = 0.73.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 83: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

In symbols, cont’d

I Cross-validation: 36 were assigned classes (with 100%)accuracy and 2 were uncertain. Median PS = 0.77

I Test data (34 samples): 24 bone marrow and 10 peripheralblood samples, 20 ALL, 14 AML.Result: 29 predicted with 100% accuracy and 5 uncertain.Median PS = 0.73.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 84: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d, clustering

I View each sample as a 6817-dimensional vector and clustersamples.

I 2-SOM on 38 samples:A1: 24 ALL, 1 AML and A2: 10 AML, 3 ALL.

ALL

AML

A1 A2

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 85: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d, clustering

I View each sample as a 6817-dimensional vector and clustersamples.

I 2-SOM on 38 samples:A1: 24 ALL, 1 AML and A2: 10 AML, 3 ALL.

ALL

AML

A1 A2

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 86: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

4-SOM

I 4-SOM on the same samples:B1: 10 AML, B2: 8 T-ALL, 1 B-ALLB3: 5 B-ALL, B4: 13 B-ALL, 1 AML.

ALL B−Cell

AML

ALL T−Cell

B1 B2 B3 B4

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 87: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d, clustering

I How to evaluate clusters to see if they represent truebiological structure?

I Idea: true structure implies more accurate predictor.

I So design predictors based on clustering classes: leads tomerging B3 and B4.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 88: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d, clustering

I How to evaluate clusters to see if they represent truebiological structure?

I Idea: true structure implies more accurate predictor.

I So design predictors based on clustering classes: leads tomerging B3 and B4.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 89: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Example cont’d, clustering

I How to evaluate clusters to see if they represent truebiological structure?

I Idea: true structure implies more accurate predictor.

I So design predictors based on clustering classes: leads tomerging B3 and B4.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 90: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Enrichment Analysis

I Goal: Look at (Gene Set)-Class correlation instead ofGene-Class correlation.

I Motivation:I Mootha 03: No single gene is significantly differentially

expressed, yet sets of genes might express differentially.I Subramanian 05: 1. Robustness to different sites, and 2.

Integrating biological knowledge.

I Methods:

I GSEA (A. Subramanian, PNAS 2005). Lung adenocarcinomawith good/poor outcome. |SB ∩ SM | = 12 and|SB ∩ SM ∩ SS | = 1 whereas SB in M was NES = 1.9,p < 0.001 and SM in B was NES=2.13, p < 0.001.

I Tibshirani and EfronI R. Gentleman (Bioconductor)I Module maps, a refinement of GSEA, gene set minimization

(Segal et al. Nature Genetics, 04,05)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 91: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Enrichment Analysis

I Goal: Look at (Gene Set)-Class correlation instead ofGene-Class correlation.

I Motivation:I Mootha 03: No single gene is significantly differentially

expressed, yet sets of genes might express differentially.I Subramanian 05: 1. Robustness to different sites, and 2.

Integrating biological knowledge.

I Methods:

I GSEA (A. Subramanian, PNAS 2005). Lung adenocarcinomawith good/poor outcome. |SB ∩ SM | = 12 and|SB ∩ SM ∩ SS | = 1 whereas SB in M was NES = 1.9,p < 0.001 and SM in B was NES=2.13, p < 0.001.

I Tibshirani and EfronI R. Gentleman (Bioconductor)I Module maps, a refinement of GSEA, gene set minimization

(Segal et al. Nature Genetics, 04,05)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 92: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Enrichment Analysis

I Goal: Look at (Gene Set)-Class correlation instead ofGene-Class correlation.

I Motivation:I Mootha 03: No single gene is significantly differentially

expressed, yet sets of genes might express differentially.I Subramanian 05: 1. Robustness to different sites, and 2.

Integrating biological knowledge.

I Methods:I GSEA (A. Subramanian, PNAS 2005). Lung adenocarcinoma

with good/poor outcome. |SB ∩ SM | = 12 and|SB ∩ SM ∩ SS | = 1 whereas SB in M was NES = 1.9,p < 0.001 and SM in B was NES=2.13, p < 0.001.

I Tibshirani and EfronI R. Gentleman (Bioconductor)I Module maps, a refinement of GSEA, gene set minimization

(Segal et al. Nature Genetics, 04,05)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 93: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Enrichment Analysis

I Goal: Look at (Gene Set)-Class correlation instead ofGene-Class correlation.

I Motivation:I Mootha 03: No single gene is significantly differentially

expressed, yet sets of genes might express differentially.I Subramanian 05: 1. Robustness to different sites, and 2.

Integrating biological knowledge.

I Methods:I GSEA (A. Subramanian, PNAS 2005). Lung adenocarcinoma

with good/poor outcome. |SB ∩ SM | = 12 and|SB ∩ SM ∩ SS | = 1 whereas SB in M was NES = 1.9,p < 0.001 and SM in B was NES=2.13, p < 0.001.

I Tibshirani and EfronI R. Gentleman (Bioconductor)I Module maps, a refinement of GSEA, gene set minimization

(Segal et al. Nature Genetics, 04,05)

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 94: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

EA cont’d

Opportunities:

I Using BLAST theory to enhance the predictive power.

I Random walks on networks.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 95: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

EA cont’d

Opportunities:

I Using BLAST theory to enhance the predictive power.

I Random walks on networks.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 96: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Chemical Genomics

I Generating large collections of small molecules and using themto modulate cellular states.

I One approach is to screen different compounds that inducestate modulations, using signatures for the states.

I GE-HTS (Stegmaier et al., Nature Genetics, 2004) 1,739compounds are screened for AML/neutrophil andAML/monocyte terminal cell differentiation.

I Connectivity Map (J. Lamb et al., Science 06):

disease – gene – drug.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 97: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Chemical Genomics

I Generating large collections of small molecules and using themto modulate cellular states.

I One approach is to screen different compounds that inducestate modulations, using signatures for the states.

I GE-HTS (Stegmaier et al., Nature Genetics, 2004) 1,739compounds are screened for AML/neutrophil andAML/monocyte terminal cell differentiation.

I Connectivity Map (J. Lamb et al., Science 06):

disease – gene – drug.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 98: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Chemical Genomics

I Generating large collections of small molecules and using themto modulate cellular states.

I One approach is to screen different compounds that inducestate modulations, using signatures for the states.

I GE-HTS (Stegmaier et al., Nature Genetics, 2004) 1,739compounds are screened for AML/neutrophil andAML/monocyte terminal cell differentiation.

I Connectivity Map (J. Lamb et al., Science 06):

disease – gene – drug.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 99: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Chemical Genomics

I Generating large collections of small molecules and using themto modulate cellular states.

I One approach is to screen different compounds that inducestate modulations, using signatures for the states.

I GE-HTS (Stegmaier et al., Nature Genetics, 2004) 1,739compounds are screened for AML/neutrophil andAML/monocyte terminal cell differentiation.

I Connectivity Map (J. Lamb et al., Science 06):

disease – gene – drug.

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 100: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

CG, cont’d

Issues:

I Necessary modifications when dealing with moreheterogeneous situations, e.g., BC vs Leukemia.

I Choice of cell type

I Measurement time

I Concentration and treatment duration

I Analytical methods for detecting relevant signals

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 101: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

CG, cont’d

Issues:

I Necessary modifications when dealing with moreheterogeneous situations, e.g., BC vs Leukemia.

I Choice of cell type

I Measurement time

I Concentration and treatment duration

I Analytical methods for detecting relevant signals

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 102: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

CG, cont’d

Issues:

I Necessary modifications when dealing with moreheterogeneous situations, e.g., BC vs Leukemia.

I Choice of cell type

I Measurement time

I Concentration and treatment duration

I Analytical methods for detecting relevant signals

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 103: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

CG, cont’d

Issues:

I Necessary modifications when dealing with moreheterogeneous situations, e.g., BC vs Leukemia.

I Choice of cell type

I Measurement time

I Concentration and treatment duration

I Analytical methods for detecting relevant signals

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 104: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

CG, cont’d

Issues:

I Necessary modifications when dealing with moreheterogeneous situations, e.g., BC vs Leukemia.

I Choice of cell type

I Measurement time

I Concentration and treatment duration

I Analytical methods for detecting relevant signals

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 105: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Toxicogenomics

I Identification of potential human and environmental toxicants,and their putative mechanisms of action, through the use ofgenomics resources.

I Gene expression is altered during toxicity.

I Challenge: Given a set of experimental conditions, define thecharacteristic and specific pattern of gene expression elicitedby a given toxicant. (Toxicant signatures!)

I Given a model organism:Known toxicants → signatures → database → determine theaction mechanism of an unknown toxicant.

I Connectivity Map: gene – toxicant

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 106: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Toxicogenomics

I Identification of potential human and environmental toxicants,and their putative mechanisms of action, through the use ofgenomics resources.

I Gene expression is altered during toxicity.

I Challenge: Given a set of experimental conditions, define thecharacteristic and specific pattern of gene expression elicitedby a given toxicant. (Toxicant signatures!)

I Given a model organism:Known toxicants → signatures → database → determine theaction mechanism of an unknown toxicant.

I Connectivity Map: gene – toxicant

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 107: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Toxicogenomics

I Identification of potential human and environmental toxicants,and their putative mechanisms of action, through the use ofgenomics resources.

I Gene expression is altered during toxicity.

I Challenge: Given a set of experimental conditions, define thecharacteristic and specific pattern of gene expression elicitedby a given toxicant. (Toxicant signatures!)

I Given a model organism:Known toxicants → signatures → database → determine theaction mechanism of an unknown toxicant.

I Connectivity Map: gene – toxicant

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 108: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Toxicogenomics

I Identification of potential human and environmental toxicants,and their putative mechanisms of action, through the use ofgenomics resources.

I Gene expression is altered during toxicity.

I Challenge: Given a set of experimental conditions, define thecharacteristic and specific pattern of gene expression elicitedby a given toxicant. (Toxicant signatures!)

I Given a model organism:Known toxicants → signatures → database → determine theaction mechanism of an unknown toxicant.

I Connectivity Map: gene – toxicant

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 109: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Toxicogenomics

I Identification of potential human and environmental toxicants,and their putative mechanisms of action, through the use ofgenomics resources.

I Gene expression is altered during toxicity.

I Challenge: Given a set of experimental conditions, define thecharacteristic and specific pattern of gene expression elicitedby a given toxicant. (Toxicant signatures!)

I Given a model organism:Known toxicants → signatures → database → determine theaction mechanism of an unknown toxicant.

I Connectivity Map: gene – toxicant

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 110: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

TG, cont’d

Issues:

I Proper definition of signature similarity

I Model system selection

I Dose selection

I Measurement time (Time series analysis?)

I Other factors: age, diet, etc

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 111: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

TG, cont’d

Issues:

I Proper definition of signature similarity

I Model system selection

I Dose selection

I Measurement time (Time series analysis?)

I Other factors: age, diet, etc

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 112: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

TG, cont’d

Issues:

I Proper definition of signature similarity

I Model system selection

I Dose selection

I Measurement time (Time series analysis?)

I Other factors: age, diet, etc

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 113: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

TG, cont’d

Issues:

I Proper definition of signature similarity

I Model system selection

I Dose selection

I Measurement time (Time series analysis?)

I Other factors: age, diet, etc

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 114: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

TG, cont’d

Issues:

I Proper definition of signature similarity

I Model system selection

I Dose selection

I Measurement time (Time series analysis?)

I Other factors: age, diet, etc

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 115: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Tools

I Bioconductor

I BRB ArrayTools (NCI, Richard Simon)

I GenePattern, includes GeneCluster (Broad Institute)

I Connectivity Map (Broad Institute)

I Benchmark data sets

I Local implementations with interface to the tools above

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 116: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Tools

I Bioconductor

I BRB ArrayTools (NCI, Richard Simon)

I GenePattern, includes GeneCluster (Broad Institute)

I Connectivity Map (Broad Institute)

I Benchmark data sets

I Local implementations with interface to the tools above

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 117: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Tools

I Bioconductor

I BRB ArrayTools (NCI, Richard Simon)

I GenePattern, includes GeneCluster (Broad Institute)

I Connectivity Map (Broad Institute)

I Benchmark data sets

I Local implementations with interface to the tools above

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 118: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Tools

I Bioconductor

I BRB ArrayTools (NCI, Richard Simon)

I GenePattern, includes GeneCluster (Broad Institute)

I Connectivity Map (Broad Institute)

I Benchmark data sets

I Local implementations with interface to the tools above

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 119: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Tools

I Bioconductor

I BRB ArrayTools (NCI, Richard Simon)

I GenePattern, includes GeneCluster (Broad Institute)

I Connectivity Map (Broad Institute)

I Benchmark data sets

I Local implementations with interface to the tools above

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 120: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Tools

I Bioconductor

I BRB ArrayTools (NCI, Richard Simon)

I GenePattern, includes GeneCluster (Broad Institute)

I Connectivity Map (Broad Institute)

I Benchmark data sets

I Local implementations with interface to the tools above

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 121: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Some ideas ...

I Technical improvements at different levels

I Integrating biological knowledge into the mathematicalmodels (after Random Markov Fields)

I New mathematical tools

I Using signatures to infer signalling pathways

I Using signatures to improve clustering algorithms

I Technology transfer from: Time-series analysis of financialdata, VLDB, Theoretical Neuroscience

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 122: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Some ideas ...

I Technical improvements at different levels

I Integrating biological knowledge into the mathematicalmodels (after Random Markov Fields)

I New mathematical tools

I Using signatures to infer signalling pathways

I Using signatures to improve clustering algorithms

I Technology transfer from: Time-series analysis of financialdata, VLDB, Theoretical Neuroscience

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 123: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Some ideas ...

I Technical improvements at different levels

I Integrating biological knowledge into the mathematicalmodels (after Random Markov Fields)

I New mathematical tools

I Using signatures to infer signalling pathways

I Using signatures to improve clustering algorithms

I Technology transfer from: Time-series analysis of financialdata, VLDB, Theoretical Neuroscience

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 124: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Some ideas ...

I Technical improvements at different levels

I Integrating biological knowledge into the mathematicalmodels (after Random Markov Fields)

I New mathematical tools

I Using signatures to infer signalling pathways

I Using signatures to improve clustering algorithms

I Technology transfer from: Time-series analysis of financialdata, VLDB, Theoretical Neuroscience

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 125: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Some ideas ...

I Technical improvements at different levels

I Integrating biological knowledge into the mathematicalmodels (after Random Markov Fields)

I New mathematical tools

I Using signatures to infer signalling pathways

I Using signatures to improve clustering algorithms

I Technology transfer from: Time-series analysis of financialdata, VLDB, Theoretical Neuroscience

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 126: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Some ideas ...

I Technical improvements at different levels

I Integrating biological knowledge into the mathematicalmodels (after Random Markov Fields)

I New mathematical tools

I Using signatures to infer signalling pathways

I Using signatures to improve clustering algorithms

I Technology transfer from: Time-series analysis of financialdata, VLDB, Theoretical Neuroscience

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 127: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Ideas, cont’d

I New biologically relevant and important questions:

I Daphnia based toxicogenomics

I Daphnia based ecogenomics

I Signatures in developmental stages of model organisms

I Time series analysis of environmental effects

I Other genomic signatures: DNA methylation patterns,microRNA profiles, metabolite profiles, ...

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 128: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Ideas, cont’d

I New biologically relevant and important questions:

I Daphnia based toxicogenomics

I Daphnia based ecogenomics

I Signatures in developmental stages of model organisms

I Time series analysis of environmental effects

I Other genomic signatures: DNA methylation patterns,microRNA profiles, metabolite profiles, ...

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 129: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Ideas, cont’d

I New biologically relevant and important questions:

I Daphnia based toxicogenomics

I Daphnia based ecogenomics

I Signatures in developmental stages of model organisms

I Time series analysis of environmental effects

I Other genomic signatures: DNA methylation patterns,microRNA profiles, metabolite profiles, ...

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 130: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Ideas, cont’d

I New biologically relevant and important questions:

I Daphnia based toxicogenomics

I Daphnia based ecogenomics

I Signatures in developmental stages of model organisms

I Time series analysis of environmental effects

I Other genomic signatures: DNA methylation patterns,microRNA profiles, metabolite profiles, ...

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 131: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Ideas, cont’d

I New biologically relevant and important questions:

I Daphnia based toxicogenomics

I Daphnia based ecogenomics

I Signatures in developmental stages of model organisms

I Time series analysis of environmental effects

I Other genomic signatures: DNA methylation patterns,microRNA profiles, metabolite profiles, ...

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes

Page 132: Molecular Classification of Biological Phenotypeshomes.sice.indiana.edu/ehaghver/genomics.pdfMolecular Classification of Biological Phenotypes Esfandiar Haghverdi School of Informatics

Ideas, cont’d

I New biologically relevant and important questions:

I Daphnia based toxicogenomics

I Daphnia based ecogenomics

I Signatures in developmental stages of model organisms

I Time series analysis of environmental effects

I Other genomic signatures: DNA methylation patterns,microRNA profiles, metabolite profiles, ...

Esfandiar Haghverdi Molecular Classification of Biological Phenotypes