Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The...
-
Upload
natalie-horn -
Category
Documents
-
view
216 -
download
0
Transcript of Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The...
• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.
• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)
• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.
• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)
LimmaLimma
Linear model analysis of Linear model analysis of microarraysmicroarrays
Bayesian regularized t-testBayesian regularized t-test(Baldi & Long 2001)(Baldi & Long 2001)
C
C
T
T
CT
nn
mmt
22
C
C
T
T
CT
nn
mmt
22
The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a
function of the mean expression of the genefunction of the mean expression of the gene
The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a
function of the mean expression of the genefunction of the mean expression of the gene
The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00
22
The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00
22
My gene
{
Bayesian regularized t-testBayesian regularized t-test
The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,
to make more robust the t-test resultsto make more robust the t-test results
The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,
to make more robust the t-test resultsto make more robust the t-test results
Bayesian regularized t-testBayesian regularized t-test
The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions
The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions
BH correctionBH correction
• BH is the most used method for the correction of BH is the most used method for the correction of type I errors in microarray analysis.type I errors in microarray analysis.
• However, it has some limitation due to the initial However, it has some limitation due to the initial hypotheses:hypotheses:– The gene expressions are independent from each The gene expressions are independent from each
other.other.– The raw distribution of p values should be uniform in The raw distribution of p values should be uniform in
the non significant range.the non significant range.
The application of BH correction to these pvalues will not produceany differential expressed gene!
The application of BH correction to these pvalues will not produceany differential expressed gene!
Let’s identify differentially expressedprobe sets by linear modelling
Let’s identify differentially expressedprobe sets by linear modelling
To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.
To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.
Next step is the definition of the contrasts, which represent the differential expression couples to be considered.
Next step is the definition of the contrasts, which represent the differential expression couples to be considered.
If more than two conditions are available more contrasts can be evaluated
If more than two conditions are available more contrasts can be evaluated
Contrast parameterization is saved with a specific name
Contrast parameterization is saved with a specific name
REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)
REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)
Before evaluating differential expression raw p-value distribution is checked.
Before evaluating differential expression raw p-value distribution is checked.
AA
BB
CC
BB
CC
AAIf BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes
If BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes
A
B
These results can be saved in a new topTable containing only the probe sets shown in red on plots
These results can be saved in a new topTable containing only the probe sets shown in red on plots
Yes
TopTable structureTopTable structure
AffyIDAffyID
Gene Symbol
Gene Symbol
Gene Description
Gene Description
Log2 FCLog2 FC
Average intensity
Average intensity
T statisticsT statistics
P-valuesP-values
Log-odd statistics
Log-odd statistics
Exercise 10 Exercise 10 (30 minutes)(30 minutes)
• Go in the folder Go in the folder estrogen.IGF1estrogen.IGF1..• Create, with excel, Create, with excel, a tab delimited filea tab delimited file named targets.txt: named targets.txt:
– Targets file is made of three columns with the following header:Targets file is made of three columns with the following header:• NameName• FileNameFileName• TargetTarget
– In column In column NameName place a brief name (e.g. c1, c2, etc) place a brief name (e.g. c1, c2, etc)– In column In column FileNameFileName place the name of the corresponding .CEL place the name of the corresponding .CEL
filefile– In column In column TargetTarget place the experimental conditions (e.g. control, place the experimental conditions (e.g. control,
treatment, etc)treatment, etc)• Create a target only for MCF7 and Sker-3 with/without Create a target only for MCF7 and Sker-3 with/without
estrogen (E2) treatment.estrogen (E2) treatment.• Calculate Probe set summaries with RMACalculate Probe set summaries with RMA
See next page
Exercise 10 Exercise 10 (30 minutes)(30 minutes)
• In this experiment we have a breast In this experiment we have a breast cancer tumor cell line (MCF7) and a tumor cancer tumor cell line (MCF7) and a tumor cell line derived by central nervous system cell line derived by central nervous system (SKER3).(SKER3).
• Question:Question:– Which are the probe sets controlled by E2 in a Which are the probe sets controlled by E2 in a
tissue independent manner?tissue independent manner?
See next page
Exercise 10Exercise 10
• Filter the data:Filter the data:– IQR 0.25, intensity 25% >100IQR 0.25, intensity 25% >100
• Calculate the models for E2 versus Calculate the models for E2 versus untreated cells both in mcf7 and sker3.untreated cells both in mcf7 and sker3.
• Contrasts:Contrasts:mcf7.e2 – mcf7.ctrlmcf7.e2 – mcf7.ctrl
sher3.e2 – sker3.ctrl sher3.e2 – sker3.ctrl
See next page
Exercise 10Exercise 10
• Evaluate if the raw p-value distributions Evaluate if the raw p-value distributions are suitable for BH correction.are suitable for BH correction.
• Question:Question:– Is the raw p-value distribution good to perfom Is the raw p-value distribution good to perfom
BH correction?BH correction?• YES NOYES NO
See next page
Exercise 10Exercise 10
• Use the “Table of Genes Ranked in order Use the “Table of Genes Ranked in order of Differential Expression”.of Differential Expression”.
• Plot differentially expressed genes with Plot differentially expressed genes with raw p-value raw p-value ≤≤ 0.05 and an absolute fold 0.05 and an absolute fold change change ≥≥ 1 for the two constrast. 1 for the two constrast.
• Save the subset of the topTables in Save the subset of the topTables in ex10.mcf7.xls, ex10.sker3.xlsex10.mcf7.xls, ex10.sker3.xls
• Save the project as ex10.lmaSave the project as ex10.lma
BB
AA
A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.
A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.
Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.
Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.
DD EE FFGG
CC
The various list subsets will be saved in your working directory
The various list subsets will be saved in your working directory
Yes
Exercise 11 Exercise 11 (15 minutes)(15 minutes)
• Using Using "Venn Diagram between probe set "Venn Diagram between probe set lists“, lists“, evaluate the level of overlap between the evaluate the level of overlap between the Entrez Genes differentially expressed upon E2 Entrez Genes differentially expressed upon E2 treatment in MCF7 and in SKER3.treatment in MCF7 and in SKER3.
• Filter the expression data by the genes in Filter the expression data by the genes in common between the two conditions and export common between the two conditions and export the Normalized Expression Values the Normalized Expression Values (ex10.common.txt).(ex10.common.txt).
Time Course experimentsTime Course experiments
• maSigPro is a R package for the analysis of single and multiseries time course microarray experiments.
• maSigPro follows a two steps regression strategy to find genes with– significant temporal expression changes – significant differences between experimental
groups.
• Time course experimental design:Time course experimental design:– We denote We denote experimental groupsexperimental groups as the experimental as the experimental
factor (dummy variables) for which temporal profiles factor (dummy variables) for which temporal profiles are defined (e.g. ”Treatment A”, ”Tissue1”, etc) are defined (e.g. ”Treatment A”, ”Tissue1”, etc)
– Conditions are Conditions are each experimental group vs. time each experimental group vs. time combinationcombination (e.g. ”Treatment A at Time 0”). (e.g. ”Treatment A at Time 0”). Conditions can have or not replicates. Conditions can have or not replicates.
– Variables are the Variables are the regression variablesregression variables defined by the defined by the maSigPro approach for the experiment regression maSigPro approach for the experiment regression model. model.
– maSigPro defines maSigPro defines dummy variablesdummy variables to model to model differences between experimental groups. differences between experimental groups.
– Dummy variables, Time and their interactions are the Dummy variables, Time and their interactions are the variablesvariables of the regression model. of the regression model.
Time Course design for maSigProTime Course design for maSigPro
All these information should be collapsed in the Target column of the targets file using _ to combine data.This can be done using the function JOIN in excel.
IMPORTANT: each treatment at each time has its corresponding untreated control!
Time Course design for maSigProTime Course design for maSigPro
Time Course design for maSigProTime Course design for maSigPro
The targets file for maSigPro has a peculiar structure:Each row of the column named Target describes the array on the basis of the experimental design.
Each element describing the time course experiment is separated from the others by an underscore.
The first three elements of the row are fixed and represent Time, Replicate, Control, all the other elements refer to various experimental conditions.
In this case we have a 8, 24 48 h time course, in triplicates with two different treatments: cond1 and cond2
The Target column is reformatted to be used by maSigPro using the command
Large data setLarge data set
• oneChannelGUI interface has some limits oneChannelGUI interface has some limits (RAM memory) in loading/handling large (RAM memory) in loading/handling large set of .CEL files. set of .CEL files.
• This is expecially true for a large time This is expecially true for a large time course experiment like our example.course experiment like our example.
• To overcome this problem probe set To overcome this problem probe set average expression intensities are average expression intensities are calculated by Expression Console.calculated by Expression Console.
Loading tab delimited file the Bioconductor annotation library is not automatically defined.
Annotation Library information can be attached using:
Do not forgetDo not forget!
• Multiple test problem is also present in Multiple test problem is also present in mSigPro analysis.mSigPro analysis.
• Therefore, before running maSigPro, Therefore, before running maSigPro, remember to perform some filter based on remember to perform some filter based on functional information or samples functional information or samples distribution.distribution.
Ones the experiment design for maSigPro is ready it is possible to run the analysis
When maSigPro is running, check what is going on in the main R window!
Yes
Some parameters need to be set
Q: The first step is to compute a regression fit for each gene. The p-value associated to the F-Statistic of the model are computed and they are subsequently used to select significant genes. maSigPro corrects this p-value for multiple comparisons by applying false discovery rate (FDR) procedures. The level of FDR control is given by the function parameter Q.
Some parameters need to be set
Alpha: maSigPro applies a variable selection procedure to find significant variables for each gene. This will ultimatelly be used to find which are the profile differences between experimental groups. At each regression step the p-value of each variable is computed and variables get in/out the model when this p-value is lower or higher than the given cut-off value alfa.
Some parameters need to be set
R-squared: The following step is to generate lists of significant genes according to the way we want to see results.As filtering maSigPro uses the R-squared of the regression model.
What is the R-squared coefficient?What is the R-squared coefficient?
• r.squared: r.squared: the "fraction of variance explained by a linearthe "fraction of variance explained by a linearmodel“model“
RR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22))
where y* is the mean of y[i] if there is an where y* is the mean of y[i] if there is an intercept and zero otherwise.intercept and zero otherwise.
YY
XX
Sum(R[i]Sum(R[i]22))
YY
XX
Sum((y[i]- y*)Sum((y[i]- y*)22))
R-squared graphical viewRR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22))
R-squared graphical viewRR22 = 1 - 0/ Sum((y[i]- y*) = 1 - 0/ Sum((y[i]- y*)22)=1)=1
YY
XX
Sum(R[i]Sum(R[i]22))
YY
XX
Sum((y[i]- y*)Sum((y[i]- y*)22))
Sum(R[i]Sum(R[i]22) = Sum((y[i]- ) = Sum((y[i]- y*)y*)22))
R-squared graphical viewRR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22)= 0)= 0
Sum((y[i]- y*)Sum((y[i]- y*)22))
YY
XX
YY
XX
Computation info are available in the main R window
Step 1
The procedure first adjusts this global model by the least-squared technique to identify differentially expressed genes and selects significant genes applying false discovery rate control procedures.
Step 2
Secondly, stepwise regression is applied as a variable selection strategy to study differences between experimental groups and to find statistically significant different profiles.
When the computation is finished a message pops up
The coefficients obtained in the second regression model will be useful to cluster together significant genes with similar expression patterns and to visualize the results.
Results can be visualized as Venn diagrams or plotting in a PDF file the curves.The K mean clustering is not yet implemented
Results can be visualized plotting in a PDF file the curves.
C
B
D
A
The plots are related only to the sub set of genes specific of each treatment condition.
Exercise 12 (30 minutes)Exercise 12 (30 minutes)• This experiment was done with HGU133A.This experiment was done with HGU133A.
– This is a cell line experiment made of three time points This is a cell line experiment made of three time points 8, 24, 48 h.8, 24, 48 h.
– Each point is made of three biological replicates.Each point is made of three biological replicates.– Two different chemotherapeutics agents have been Two different chemotherapeutics agents have been
used (Treatment 1 and 2)used (Treatment 1 and 2)– Since these data have not yet published the probe set Since these data have not yet published the probe set
ids have been scrambled.ids have been scrambled.• In the time.course directory there are two files:In the time.course directory there are two files:
– An expression file derived from expression consoleAn expression file derived from expression console– A tab delimited file describing the experimental A tab delimited file describing the experimental
conditions.conditions.• Use this information to load the data, filter them by Use this information to load the data, filter them by
IQR (threshold of your choice), to run (e.g. IQR (threshold of your choice), to run (e.g. Q=0.05, Q=0.05, =0.05, R=0.8) and view results =0.05, R=0.8) and view results generated by maSigPro.generated by maSigPro.
Analysis pipe-lineAnalysis pipe-line
NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis
AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction
QualityQualitycontrolcontrol
AnnotationAnnotation
• An important issue in microarray data An important issue in microarray data analysis is the specific association of analysis is the specific association of probe identifiers with genome annotated probe identifiers with genome annotated transcripts. transcripts.
• A critical point in annotation is the way A critical point in annotation is the way in which the association between in which the association between probes and genes is produced.probes and genes is produced.
Annotation in AffymetrixAnnotation in Affymetrix• NetAffxNetAffx: Affymetrix annotation repository: Affymetrix annotation repository• Bioconductor:Bioconductor:
– uses a specific annotation library, AnnBuilder, to create annotation uses a specific annotation library, AnnBuilder, to create annotation libraries starting from the association probe set identifierlibraries starting from the association probe set identifierGeneBank GeneBank accession number (i.e. the primary target for probes design). accession number (i.e. the primary target for probes design).
• RESOURCERER (Tsai et al. 2001):RESOURCERER (Tsai et al. 2001):– the annotation tool at TIGR center uses EST and gene sequences the annotation tool at TIGR center uses EST and gene sequences
stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). – They provide an analysis of publicly available EST and gene sequence They provide an analysis of publicly available EST and gene sequence
data for the identification of transcripts and their placement in a genomic data for the identification of transcripts and their placement in a genomic context, and the identification of orthologs and paralogs wherever context, and the identification of orthologs and paralogs wherever possible. possible.
• Neither Bioconductor nor TIGR methods operate at the probe level, Neither Bioconductor nor TIGR methods operate at the probe level, nor do they consider the limited reliability of some sets due to probe nor do they consider the limited reliability of some sets due to probe cross-hybridization or erroneous probe/transcript annotation. cross-hybridization or erroneous probe/transcript annotation.
• Ensembl:Ensembl:– Annotation with the Ensembl tool is built by direct matching of Affymetrix Annotation with the Ensembl tool is built by direct matching of Affymetrix
probes over the Ensembl sequence database. probes over the Ensembl sequence database. – Its weak point is that matching of only 50% of the probes of a specific set Its weak point is that matching of only 50% of the probes of a specific set
to an Ensembl gene is needed for a true association definition "probe set to an Ensembl gene is needed for a true association definition "probe set identifier"/"Ensembl gene identifier". identifier"/"Ensembl gene identifier".
Gene OntologyGene Ontology
OntologiesOntologies
• An ontology is a specification of a An ontology is a specification of a conceptualization:conceptualization:– a hierarchical mapping of concepts within a given frame a hierarchical mapping of concepts within a given frame
of reference.of reference.
• An ontology is a restricted structured vocabulary of An ontology is a restricted structured vocabulary of terms that represent domain knowledge. terms that represent domain knowledge.
• An ontology specifies a vocabulary that can be An ontology specifies a vocabulary that can be used to exchange queries and assertions. used to exchange queries and assertions.
• A commitment to the use of the ontology is an A commitment to the use of the ontology is an agreement to use the shared vocabulary in a agreement to use the shared vocabulary in a consistent way.consistent way.
The Gene OntologyThe Gene Ontology• The goal of the Gene Ontology (GO) Consortium is to The goal of the Gene Ontology (GO) Consortium is to
produce a controlled vocabulary that can be applied to all produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in organisms even as knowledge of gene and protein roles in cells is accumulating and changing. cells is accumulating and changing. – http://www.geneontology.org/
• For genes and gene products the Gene Ontology For genes and gene products the Gene Ontology Consortium (GO) is an initiative that is designed to address Consortium (GO) is an initiative that is designed to address the problem of defining the problem of defining common set of terms and common set of terms and descriptions for basic biological functionsdescriptions for basic biological functions..
• GO provides a restricted vocabulary as well as clear GO provides a restricted vocabulary as well as clear indications of the relationships between terms.indications of the relationships between terms.
The Gene OntologyThe Gene Ontology
• The Gene Ontology (GO) consortium produces three independent ontologies for gene products.
• The three ontologies are:– molecular function of a gene product which is defined to
be biochemical activity or action of the gene product (MF 7220).
– biological process interpreted as a biological objective to which the gene product contributes (BP 9529).
– cellular component is a component of a cell that is part of some larger object or structure (CC 1536).
The Graph Structure of GOThe Graph Structure of GO
• The GO ontologies are structured as directed acyclic graphs (DAGs) that represent a network in which each term may be a child of one or more parents.
• GO node is interchangeable with GO term.• Child terms are more specific than their
parents:– The term “transmembrane receptor protein-
tyrosine kinase” is child of • “transmembrane receptor” and “protein tyrosine
kinase”.
The Graph Structure of GOThe Graph Structure of GO
• The relationship between a child and a parent can be characterized by the relations:– is a – has a (part of)
• “mitotic chromosome” is a child of “chromosome” and the relationship is an is a relation.
• “telomere” is a child of “chromosome” with the has a relation.
Top node
Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)
GO structureGO structure
Induced GO graph for a set of diff exprs genes.Induced GO graph for a set of diff exprs genes.
GO can be used to link differentially expressed GO can be used to link differentially expressed genes to specific functional classesgenes to specific functional classes..
Top node
The induced GO graph colored according to unadjusted hypergeometric p-The induced GO graph colored according to unadjusted hypergeometric p-valuevalue0.010.01
Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms
shown below as different colors.shown below as different colors.
Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms
shown below as different colors.shown below as different colors.
Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes
Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes
What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of
differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a
discovered GO term?discovered GO term?
What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of
differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a
discovered GO term?discovered GO term?
Example:Example: Population Size: Population Size: 40 genes40 genes
Subset of differentially Subset of differentially expressed genes: expressed genes: 12 genes12 genes
10 genes, shown in light blue, have a common GO term 10 genes, shown in light blue, have a common GO term and 8 occur within the set of differentially expressed and 8 occur within the set of differentially expressed genes.genes.
Contingency MatrixContingency Matrix
A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed
membership and membership to a GO term.
A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed
membership and membership to a GO term.
outout
inin
GO termGO term
outoutininSubsetSubset
22
44 2626
88
ContingencyContingencyMatrixMatrix
Hypergeometric Hypergeometric DistributionDistribution
aa bb
cc dd
a+ca+c
a+ba+b
b+db+d
c+dc+d
!!!!!
)!()!()!()!(
)!()!(!
!!)!(
!!)!(
dcban
dbcadcba
dcban
dbdb
caca
The probability of any The probability of any particularparticularmatrix occurring by randommatrix occurring by randomselection, given no associationselection, given no associationbetween the two variables, is givenbetween the two variables, is givenby the by the hypergeometric rulehypergeometric rule..
Assigning Significance to the FindingsAssigning Significance to the Findings
The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two
variables, variables, differential expression membership and membership to differential expression membership and membership to
a a particular Gene Ontology term. particular Gene Ontology term.
The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two
variables, variables, differential expression membership and membership to differential expression membership and membership to
a a particular Gene Ontology term. particular Gene Ontology term.
88 22
44 2626
inin outout
inin
outout
SubsetSubset
GO termGO term p p .0002 .0002
( 2x2 contingency matrix )( 2x2 contingency matrix )
EASEEASE(Expression Analysis Systematic Explorer)(Expression Analysis Systematic Explorer)
• EASE analysis identifies prevalent biological EASE analysis identifies prevalent biological themes within gene clusters.themes within gene clusters.
• The highest-ranking themes derived by a The highest-ranking themes derived by a computational method can recapitulate manually computational method can recapitulate manually derived themes in previously published derived themes in previously published microarray, proteomics and SAGE results, and microarray, proteomics and SAGE results, and to provide evidence that these themes are stable to provide evidence that these themes are stable to varying methods of gene selection.to varying methods of gene selection.
Hosack et al. Genome Biol., 4:R70-R70.8, 2003.Hosack et al. Genome Biol., 4:R70-R70.8, 2003.
• Consider all of the ResultsConsider all of the Results
EASE reports all themes represented in a cluster EASE reports all themes represented in a cluster and although some themes may not meet and although some themes may not meet statistical significance it may still be important statistical significance it may still be important to note that particular biological roles or to note that particular biological roles or pathways are represented in the cluster.pathways are represented in the cluster.
• Independently Verify RolesIndependently Verify Roles
Once found, biological themes should be Once found, biological themes should be independently verified using annotation independently verified using annotation resources.resources.
EASE ResultsEASE Results
GOstats packageGOstats package
• To perform an analysis using the To perform an analysis using the Hypergeometric-based test, one needs to define Hypergeometric-based test, one needs to define a a gene universegene universe and a list of and a list of selected genesselected genes from the universe.from the universe.
• To identify the set of expressed genes from a To identify the set of expressed genes from a microarray experiment, R. Gentleman (GOstats microarray experiment, R. Gentleman (GOstats developer) proposed that a non-specific filter be developer) proposed that a non-specific filter be applied and that the genes that pass the filter be applied and that the genes that pass the filter be used to form the universe for any subsequent used to form the universe for any subsequent functional analyses.functional analyses.
In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.
In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.
Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.
Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.
Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!
Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!
A
B
D
C
If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.
If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.
The reason of this representation is the selection of the GO terms that
contains smaller subsets.
GO identifierGO identifierGO identifierGO identifier
Description of Description of GO termGO term
Description of Description of GO termGO term
significancesignificancesignificancesignificance
N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse
N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse
N. of genes in the N. of genes in the differentially differentially expressed setexpressed set
N. of genes in the N. of genes in the differentially differentially expressed setexpressed set
To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function
To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function
It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.
It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.
A
C
B
D
Exercise 13 Exercise 13 (20 minutes)(20 minutes)
• Using GOenrichment function, check if Using GOenrichment function, check if there is any overlap between the GO there is any overlap between the GO classes BP found enriched (p-value classes BP found enriched (p-value 0.01) using the set of probe sets found 0.01) using the set of probe sets found differentially expressed upon E2 treatment differentially expressed upon E2 treatment in MCF7 or SKER3.in MCF7 or SKER3.
• Question:Question:– Which are the BP or MF GO terms in common Which are the BP or MF GO terms in common
between the two set of differentially exprssed between the two set of differentially exprssed probe sets?probe sets? See next page
Exercise 13 Exercise 13 (10 minutes)(10 minutes)
• Using plotGO see which are the parents of the Using plotGO see which are the parents of the GO term(s) in common between the probe sets GO term(s) in common between the probe sets differentially expressed in MCF7 and those in differentially expressed in MCF7 and those in SKER3 upon E2 treatment.SKER3 upon E2 treatment.
• Using extractAffyids function, check the number Using extractAffyids function, check the number of probe sets derived by limma differential of probe sets derived by limma differential expression also present in the common GO expression also present in the common GO termsterms.
• Question:– Probe sets belonging to the common GO terms are Probe sets belonging to the common GO terms are
the same in the two differential expression analyses?the same in the two differential expression analyses?
ClusteringClustering
Is it available an ideal clustering Is it available an ideal clustering procedure?procedure?
• No!No!– Each clustering algorithm has it ideal data Each clustering algorithm has it ideal data
structure.structure.
• Since we do not know which is the data Since we do not know which is the data structure:structure:
• Various clustering methods have to be applied in Various clustering methods have to be applied in order to identify the one that better fit to the data order to identify the one that better fit to the data under analysisunder analysis
N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)
Supervised versus unsupervised Supervised versus unsupervised clusteringclustering
• Supervised clusteringSupervised clustering try to find the best try to find the best partition for data that belong to a know set partition for data that belong to a know set of classesof classes
• Unsupervised clusteringUnsupervised clustering try to define the try to define the number and the size of the classes in number and the size of the classes in which the transcription profiles can be which the transcription profiles can be fitted in.fitted in.
The Expression Matrix is a representation of data from multipleThe Expression Matrix is a representation of data from multiplemicroarray experiments.microarray experiments.
N
D
X11 X12 X13 … X1d (L)
X21 X22 X23 … X2d (L)
…
Xn1 Xn2 Xn3 … xnd (L)
experiment
Probe set
Each element is a log ratioEach element is a log ratio
+
-
0
Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down
modulation as modulation as GREENGREEN
Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down
modulation as modulation as GREENGREEN
Large data set can be loaded as tab delimited
files
Large data set can be loaded as tab delimited
files
To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.
To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.
This file can be generated joining the columns on the clinical parameters by an underscore “_”.
This file can be generated joining the columns on the clinical parameters by an underscore “_”.
Join function in excelJoin function in excel
Loading data as tab delimited fileLoading data as tab delimited file
Select as format description tab delimited files
Select as format description tab delimited files
Export expression data as tab delimited files
Export expression data as tab delimited files
Select the first numerical value and load the dataSelect the first numerical value and load the data
Expression VectorsExpression Vectors
• Gene Expression Vectors Gene Expression Vectors encapsulate the expression of a encapsulate the expression of a gene over a set of experimental gene over a set of experimental conditions or sample types.conditions or sample types.
--0.80.8
0.80.8 1.51.5
1.81.8 0.50.5
--1.31.3
--0.40.4
1.51.5
-2
0
2
1 2 3 4 5 6 7 8
loglog22(time(timett//timetime00))
Data reformattingData reformatting• Clustering can be performed using as reference a virtual array:Clustering can be performed using as reference a virtual array:
– A virtual array can be calculated averaging gene expression over the A virtual array can be calculated averaging gene expression over the experimental conditions.experimental conditions.
• Clustering can be performed building virtual two-dye Clustering can be performed building virtual two-dye experiments:experiments:
where i=1…I, j=1…Jwhere i=1…I, j=1…J
• Clustering can be performed also without the use of a common Clustering can be performed also without the use of a common reference by:reference by:– Genes centeringGenes centering
– Experiments centeringExperiments centering
C
T2log
j
i
C
T2logor
row
rowii
XZ
row
rowii
XZ
col
colii
XZ
col
colii
XZ
Data reformattingData reformatting
row
rowii
XZ
row
rowii
XZ
col
colii
XZ
Gene centering
Array centering
Centering at gene levels removes thescaling differences!
Centering at gene levels removes thescaling differences!
Various data reformating are availableVarious data reformating are available
We will use mainly gene/row adjustmentWe will use mainly gene/row adjustment
Distance and SimilarityDistance and Similarity
• The ability to calculate a distance (or The ability to calculate a distance (or similarity, it’s inverse) between two similarity, it’s inverse) between two expression vectors is fundamental to expression vectors is fundamental to clustering algorithms.clustering algorithms.
• Distance between vectors is the basis Distance between vectors is the basis upon which decisions are made when upon which decisions are made when grouping similar patterns of expression.grouping similar patterns of expression.
• Selection of a Selection of a distance metricdistance metric defines the defines the concept of distance.concept of distance.
x = (5,5)
y = (9,8)Euclidean distance:d(x,y) = (42+32) = 5
Manhattan distance:d(x,y) = 4+3 = 7
4
35
Distance is Defined by a MetricDistance is Defined by a Metric
Distance is Defined by a MetricDistance is Defined by a Metric
Euclidean Pearson Distance Metric:
4.2
1.4
1.00
0.90D
D
-2
0
2
log
log 22(
time
(tim
e tt/tim
e/t
ime 00))
Many distance metrics are available.If a selection is not performed the deafult
selection for each type of clustering approach will be used.
Many distance metrics are available.If a selection is not performed the deafult
selection for each type of clustering approach will be used.
Hierarchical Clustering Hierarchical Clustering (HCL(HCL)
• HCL is an agglomerative/divisive HCL is an agglomerative/divisive clustering method. clustering method.
• The iterative process continues until all The iterative process continues until all groups are connected in a hierarchical groups are connected in a hierarchical tree.tree.
Hierarchical Clustering Hierarchical Clustering (agglomerative)(agglomerative)
g8g1 g2 g3 g4 g5 g6 g7
g7g1 g8 g2 g3 g4 g5 g6
g7g1 g8 g4 g2 g3 g5 g6
g1 is most like g8
g4 is most like {g1, g8}
g7g1 g8 g4 g2 g3 g5 g6
g6g1 g8 g4 g2 g3 g5 g7
g6g1 g8 g4 g5 g7 g2 g3
Hierarchical ClusteringHierarchical Clustering
g5 is most like g7
{g5,g7} is most like {g1, g4, g8}
g6g1 g8 g4 g5 g7 g2 g3
Hierarchical TreeHierarchical Tree
Hierarchical ClusteringHierarchical Clustering
• During construction of the hierarchy, During construction of the hierarchy, decisions must be made to determine decisions must be made to determine which clusters should be joined. which clusters should be joined.
• The distance or similarity between clusters The distance or similarity between clusters must be calculated. The rules that govern must be calculated. The rules that govern this calculation are this calculation are linkage methodslinkage methods..
Agglomerative Linkage MethodsAgglomerative Linkage Methods
• Linkage methods are rules or metrics that Linkage methods are rules or metrics that return a value that can be used to return a value that can be used to determine which elements (clusters) determine which elements (clusters) should be linked.should be linked.
• Three linkage methods that are commonly Three linkage methods that are commonly used are: used are: – Single LinkageSingle Linkage– Average LinkageAverage Linkage– Complete LinkageComplete Linkage
Single LinkageSingle Linkage• Cluster-to-cluster distance is defined Cluster-to-cluster distance is defined
as as the minimum distance between the minimum distance between members of one cluster and members of one cluster and members of the another clustermembers of the another cluster. .
• Single linkage tends to create Single linkage tends to create ‘elongated’ clusters with individual ‘elongated’ clusters with individual genes chained onto clusters.genes chained onto clusters.
DAB
Single
Average LinkageAverage Linkage• Cluster-to-cluster distance is Cluster-to-cluster distance is
defined as defined as the average distance the average distance between all members of one between all members of one cluster and all members of another cluster and all members of another clustercluster. .
• Average linkage has a slight Average linkage has a slight tendency to produce clusters of tendency to produce clusters of similar variance.similar variance.
DAB
Ave.
Complete LinkageComplete Linkage
DAB
• Cluster-to-cluster distance is Cluster-to-cluster distance is defined as defined as the maximum distance the maximum distance between members of one cluster between members of one cluster and members of the another and members of the another clustercluster. .
• Complete linkage tends to create Complete linkage tends to create clusters of similar size and clusters of similar size and variabilityvariability
Complete
HCLHCL• A clustering result can be represented by A clustering result can be represented by
many different graphical views.many different graphical views.
1 2 3 4 1 2 34 12 34
HCLHCL
• HCL does not converge to a unique result HCL does not converge to a unique result and each run represent one of the and each run represent one of the possible solution.possible solution.
• To obtain information on cluster stability a To obtain information on cluster stability a resampling method should be applied:resampling method should be applied:– Bootstrapping:
• resampling with replacement
– Jackknifing:• resampling without replacement
To perform HCL click on HCL iconTo perform HCL click on HCL icon
To see results click onTo see results click on
Visualization can be reformattedVisualization can be reformatted
Bootstrapping (ST)Bootstrapping (ST)
Bootstrapping – resampling with replacement
Original expression matrix:
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Various bootstrapped matrices (by experiments):
Exp 2 Exp 3 Exp 4
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Exp 2 Exp 4 Exp 4 Exp 1 Exp 3 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Exp 1 Exp 5
Jackknifing (ST)Jackknifing (ST)Jackknifing – resampling without replacement
Original expression matrix:
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Various jackknifed matrices (by experiments):
Exp 1 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Exp 1 Exp 2 Exp 3 Exp 4 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
1000 bootstrapsEuclidean distanceAverage clustering
1000 bootstrapsEuclidean distanceAverage clustering
To run HCL with resamplingTo run HCL with resampling
To see results click onTo see results click on
A sub set of genes can be selected clicking on the
node of interest
A sub set of genes can be selected clicking on the
node of interest
Locating the mouse over the
node and clicking on the right mouse
botton various information about
the group of genes can be saved
Locating the mouse over the
node and clicking on the right mouse
botton various information about
the group of genes can be saved
Principal component analysisPrincipal component analysis
• Principal component analysis (PCA) involves a Principal component analysis (PCA) involves a mathematical procedure that transforms a number of mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called uncorrelated variables called principal componentsprincipal components. .
• The first principal component accounts for as much of The first principal component accounts for as much of the variability in the data as possiblethe variability in the data as possible
• Each succeeding component accounts for as much of Each succeeding component accounts for as much of the remaining variability as possible. the remaining variability as possible.
• The components can be thought of as axes in n-The components can be thought of as axes in n-dimensional space, where n is the number of dimensional space, where n is the number of components. Each axis represents a different trend in components. Each axis represents a different trend in the data.the data.
In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably
represented in a 3D space.
In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably
represented in a 3D space.
2
1
2° PC will be orthogonal to the 1st
A
Cluster 1
Cluster 2
Cluster 1
MCF7 SKER-3E2 IGFE2 IGF
MCF7 SKER-3
E2 IGFE2 IGF
Cluster 2
The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables
The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables
Quaglino et al. J. Clin. Invest. 2004
The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables
The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables
We have already used PCA for quality controlWe have already used PCA for quality control
Results clicking onResults clicking on
Click on right mouse buttonOver 3D/2D
view
Click on right mouse buttonOver 3D/2D
view
Cluster Affinity Search Technique Cluster Affinity Search Technique (CAST)(CAST)
• CAST uses an iterative approach to CAST uses an iterative approach to segregate elements with ‘high affinity’ into segregate elements with ‘high affinity’ into a cluster.a cluster.
• The process iterates through two phases:The process iterates through two phases:– additionaddition of high affinity elements to the of high affinity elements to the
cluster being createdcluster being created– removalremoval or clean-up of low affinity elements or clean-up of low affinity elements
from the cluster being createdfrom the cluster being created
Clustering Affinity Search Technique (CAST)-1Clustering Affinity Search Technique (CAST)-1Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as%age of maximum affinity at that point
1. Create a new empty cluster C1.
3. Move the two most similar genes into the new cluster.
Empty cluster C1
G2G4
G9
G8
G12
G6
G1
G7
G13
G11
G14
G3
G5 G15
G10
Unassigned genes
4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1)
2. Set initial affinity of all genes to zero
5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds theuser-specified threshold affinity, pick the unassigned gene whose affinity is the highest,and add it to cluster C1. Update the affinities of all the genes accordingly.
ADD GENES:
CAST – 2CAST – 2
6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, removethe lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene.
7. Repeat step 6 while C1 contains a low-affinity gene.
8. Repeat steps 5-7 as long as changes occur to the cluster C1.
REMOVE GENES:
9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps1-8.
10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster
Current cluster C1
G2G4
G9
G8
G12G6
G1 G7
G13
G11
G14
G3
G5
G15G10
Unassigned genes
Click onClick on
Parameter to be setParameter to be set
SOMsSOMs
Self-organizing maps (SOMs) – 1Self-organizing maps (SOMs) – 1
1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal
N = NodesG = GenesG1 G6
G3
G5
G4G2
G11
G7G8
G10
G9
G12 G13
G14G15
G19G17
G22
G18
G20
G16
G21G23
G25G24
G26 G27
G29G28
N1 N2
N3 N4
N5 N6
SOMs – 2SOMs – 22. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The node closest to G9 (N2) is movedthe most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved.
G1 G6
G3
G5G4
G2
G11
G7G8
G10G9
G12 G13G14
G15
G19G17
G22
G18G20
G16
G21G23
G25G24
G26 G27
G29G28
N1 N2
N3 N4
N5 N6
SOM Neighborhood OptionsSOM Neighborhood Options
G11
G7G8
G10G9
N1 N2
N3 N4
N5 N6
G11
G7G8
G10G9
N1 N2
N3 N4
N5 N6
Bubble Neighborhood
Gaussian
Neighborhoodradius
All move, alpha is scaled.
Some move, alpha is constant.
SOMs – 3SOMs – 3
4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) arerepeated many (usually several thousand) times. However, with each iteration, the amountthat the nodes are allowed to move is decreased.
5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than itsdistance to any other node
G1 G6
G3
G5G4
G2
G11
G7G8
G10G9
G12 G13G14
G15
G19G17
G22
G18G20
G16
G21G23
G25G24
G26 G27
G29G28
N1 N2
N3
N4
N5N6
Click onClick on
Exercise 14Exercise 14• This exercise is based on the breast cancer data This exercise is based on the breast cancer data
set published by Chin on Cancer Cell 2006 set published by Chin on Cancer Cell 2006 (hgu133A HT platform)(hgu133A HT platform)
• Using the clinical data (E-TABM-158-Using the clinical data (E-TABM-158-clinical.data.txt) available in large.data.set dir:clinical.data.txt) available in large.data.set dir:– Construct a target file, like the one used in time Construct a target file, like the one used in time
course.course.– Load the data in E-TABM-158-processed-data.txt Load the data in E-TABM-158-processed-data.txt
using the created target file.using the created target file.– Filter the data by IQR 0.5 and 25% of samples should Filter the data by IQR 0.5 and 25% of samples should
have a signal over 100 as intensity.have a signal over 100 as intensity.– Save project as ex14.lmaSave project as ex14.lma
Exercise 14Exercise 14– Filter the data on the basis of a list of EGs Filter the data on the basis of a list of EGs
derived by Ingenuity related to cell signaling derived by Ingenuity related to cell signaling (use the advance search at Ingenuity). (use the advance search at Ingenuity).
– Export the data as tab delimited files. Export the data as tab delimited files.
• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene
cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.
– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.
Exercise 14 Exercise 14 (30 minutes)(30 minutes)
• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene
cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.
– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.
– Subset and save the clusters you have Subset and save the clusters you have identified.identified.
– Combine them in excel.Combine them in excel.– Load them in TMEV and see how PCA divide Load them in TMEV and see how PCA divide
the samples.the samples.
ClassificationClassification
ClassificationClassification
• The task of diagnosing cancer on the basis of microarray data has been termed class prediction in the literature.
• The task is to classify and predict the The task is to classify and predict the diagnostic category of a sample on the diagnostic category of a sample on the basis of its gene expression profile. basis of its gene expression profile.
The example of classification The example of classification problem used in PAM publicationproblem used in PAM publication
• Data for small round blue cell tumors (SRBCT) of childhood (Khan et al. 2001), consisting of expression measurements on 2,308 genes, were obtained from glass-slide cDNA microarrays.
• The tumors are classified as:– Burkitt lymphoma (BL),– Ewing sarcoma (EWS), – neuroblastoma (NB), – rhabdomyosarcoma(RMS).
• A total of 63 training samples and 25 test samples
were provided, although five of the latter were not SRBCTs.
PAMPAM
• PAM is a modification of the nearest-PAM is a modification of the nearest-centroid method, called ‘‘nearest shrunken centroid method, called ‘‘nearest shrunken centroid.’’centroid.’’
• PAM uses ‘‘de-noised’’ versions of the PAM uses ‘‘de-noised’’ versions of the centroids as prototypes for each class. centroids as prototypes for each class.
Centroids (Centroids (greygrey) and shrunken centroids () and shrunken centroids (redred) for the SRBCT dataset) for the SRBCT datasetThe overall centroid has been subtracted from the centroid from each class.The overall centroid has been subtracted from the centroid from each class.
• SBRCT classification: training (tr, green), cross-validation (cv, red), and test (te, blue) errors are shown as a function of the threshold parameter .
• The value 4.34 is chosen and yields a subset of 43 selected genes.
• Shrunken differences dik for the 43 genes having at least one nonzero difference. • The genes with nonzero components in each class are almost mutually exclusive.
PAM performancePAM performance
• Misclassification rates for seven classifiers on six microarray datasets based on 50 Misclassification rates for seven classifiers on six microarray datasets based on 50 random partitions into learning sets (two-thirds of the data) and test sets (one-third of random partitions into learning sets (two-thirds of the data) and test sets (one-third of the data)the data)
• The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks NNR and DLDA do surprisingly well and can almost keep up except on the prostate NNR and DLDA do surprisingly well and can almost keep up except on the prostate data (the largest dataset in the analysis).data (the largest dataset in the analysis).
• The success of such methodologically simple tools is limited to gene expression The success of such methodologically simple tools is limited to gene expression datasets with small sample size.datasets with small sample size.
Riorganize clinical information
Load a large data set as tab delimited file.Save in a file the description of the clinical parameters collapsed in the Target column of the targets file.
Riorganize clinical information
run PAMR analysis
If the selected probe sets are less than 50If the selected probe sets are less than 50
Yes
Nice separation between ER positive Nice separation between ER positive and negative samples can be achieved and negative samples can be achieved
also on the test set also on the test set
Exercise 15Exercise 15
• Load the ex14.lma.
• Attach the clinical parameters description
• Divide the data in training and test set on the base of one of the non-continuous parameters (e.g Yes/No; Pos/Neg).
• Use PAMR to define the minimal subset of genes, if any, discriminating between the two groups.
Revision exerciseRevision exercise
• Use the data set HuGene, made of:– Three breast samples:
• TisMap_Breast_01_v1_WTGene1.CEL• TisMap_Breast_02_v1_WTGene1.CEL• TisMap_Breast_03_v1_WTGene1.CEL
– Three brain samples:• TisMap_Brain_01_v1_WTGene1.CEL• TisMap_Brain_02_v1_WTGene1.CEL• TisMap_Brain_03_v1_WTGene1.CEL
• Perform all the steps of a microarray analysis:– QC, filtering, statistical analysis, GO analysis.