Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the...

108
Microarray data analysis with Chipster 3.-4.6.2008 Jarno Tuimala Eija Korpelainen

Transcript of Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the...

Page 1: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Microarray data analysis with Chipster3.-4.6.2008

Jarno TuimalaEija Korpelainen

Page 2: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Program – an analysis workflow

Day 1.Basic functionality of Chipster (Eija)Data import (Eija)Quality control (Jarno)Normalization (Jarno)

• Describing the experimentFiltering and missing value considerations (Jarno)

Day 2.Statistical testing (Jarno)Clustering and visualization (Jarno)Annotation (Eija)Promoter analysis (Eija)Experimental design (Jarno) – if time allows

Page 3: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Introduction to microarraysIntroduction to microarrays

Page 4: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Research using microarrays

Plan!• Experimental design

Laboratory work• Extract, label, hybridize

Computer work• Scanning, image analysis• Bioinformatics

Laboratory work• Confirmation

Publish• Submit data to public databases

Page 5: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Introduction to Chipster

Page 6: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

ChipsterMicroarray data analysis software developed at CSCGoal: Easy access to leading analysis tools such as those developed in theR/Bioconductor project

Features• Easy to use graphical user interface• Comprehensive selection of tools• Support for different array types (Affymetrix, Agilent, Illumina, cDNA)• Compatible with Windows, Linux and Mac OS X• Easy to install and update• Wizards and workflows• Interactive graphics • Transparency (as opposed to “black box”)• Alternative annotations for Affymetrix arrays• Automatic tracking of performed analyses

http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdfhttp://chipster.csc.fi

Page 7: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

How does it work?

internet

front end

SSL

SOAP

international Web ServicesANALYSIS VISUALISATION

CSC desktop

client Java Web Startinstalls and updates client automatically

Murska-cluster

analyser

security

Page 8: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Aleksi KallioJarno TuimalaTaavi HupponenMika Rissanen, Janne Käki, Mikko Koski, Petri Klemelä

All the pilot usersDepartment of computer science (HY)Dario Greco (HY)Prof. Olli Yli-Harja’s group (TUT)GeneCruiser team (MIT Broad Institute)

Tekes/SA SYSBIO-program

Acknowledgements

Page 9: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data
Page 10: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Data ToolsVisu

aliza

tion

Page 11: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Phenodata – describing your experiment

Phenodata file is created during normalizationFill in the group column with numbers describing your experimental setup

• e.g. 1 = healthy control, 2 = cancer sample• necessary for the statistical tests to work

If you bring in previously created normalized data and phenodata:• Choose ”import directly” in the Import tool• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”

If you brought in normalized data and need to create phenodata for it:• Utilities/ Generate phenodata (fill in the chiptype parameter!)• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”• Fill in the group column

Page 12: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Visualizing the data

Data visualization panel• Maximize and redraw for better viewing

Two types of visualizations1. Interactive visualizations produced by the client program

• Select the visualization method from the pulldown menu of the data visualization panel

• Save by right clicking on the image2. Static images produced by R/Bioconductor, Weeder, etc

• Select from Analysis tools/ Visualisation• View by double clicking on the image file• Save by right clicking on the file name and choosing ”Export”

Page 13: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Interactive visualizations by the clientSpreadsheetHistogramScatterplot3D scatterplotExpression profilesClustered profilesHierarchical clusteringSOM clusteringArray pseudo-image

Available actions:Change titles, colors etcZoom in/outSelect and annotate genes using the MIT GeneCruiser

Page 14: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data
Page 15: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data
Page 16: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Static images produced by R/Bioconductor

Volcano plotBox plotHistogramHeatmapVenn diagramIdiogramChromosomal positionCorrelogramDendrogramQC stats plotRNA degradation plotK-means clusteringSOM-clustering

Page 17: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data
Page 18: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Automatic tracking of analysis history

Page 19: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Running many analyses simultaneouslyYou can have max 5 analysis jobs running at the same timeUse Task manager to

• view parameters, status,…• cancel jobs

Page 20: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Workspace – continue later/elsewhere

Saving your workspace allows you to continue later• File/ Save workspace• File/ Load workspace

Currently it is possible to have only one workspace saved at the time

If you would like to continue your work on another computer, youneed to transfer the workspace-snapshot -folder to the corresponding location

• C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot

Page 21: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Workflow – reusing your analysis pipeline

Creates a ”macro” that can be applied to another normalized dataset and phenodata

Choose a dataset, and workflow recordsthe analysis steps starting from it

You can give the workflow a meaningfulname (e.g. affymetrix.bsh), but it has to be located in the chipster-scripts folderunder nami-work-files

You can run the workflow on anothercomputer by making it visible to Chipsterwith ”Reload workflows from disk”

You can change parameters directly to the workflow file

Page 22: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Wizard– autopilot for analysis

Page 23: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Wizard for Affymetrix data

Ready-made workflow to find differentiallyexpressed genes

• Normalization• Phenodata creation• Statistical test• Hierarchical clustering

Page 24: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Importing files

Affymetrix CEL-files are imported to Chipster automatically

Other files are imported using the Import tool

Page 25: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Import tool, step 1

Define• Title row• Header• Footer• Delimiter

Page 26: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Import tool, step 2

Define columnsModify flags

Page 27: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Which columns to use?

https://extras.csc.fi/biosciences/chipster-manual/import-help.html

Agilent• Identifier (ProbeName)• Sample (rMeanSignal) • Sample background (rBGMedianSignal) • Control (gMeanSignal) • Control background (gBGMedianSignal)

Illumina BeadStudio version 1-2 file• Identifier (TargetID)• Sample (text “AVG”)

Illumina BeadStudio version 3 file• Identifier (ProbeID). Do not use the column PROBE_ID.• Sample (text “AVG”)

Page 28: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Quality control

Page 29: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Quality control tools

Quality control -tools• Affymetrix basic

RNA degradation + Affy QC• Affymetrix RLE & NUSE (might take a long time to run)

Fits a model to expression values• Agilent

MA-plot + density plot + boxplotVisualization – dendrogramStatistics - NMDS

Page 30: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Affymetrix I

Quality control tools are run on raw data (CEL files).• Dendrogram and NMDS on normalized data

Page 31: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Affymetrix II

Page 32: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Agilent

Page 33: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

A scatter plot of A scatter plot of log intensity ratios log intensity ratios

M = logM = log22(R/G)(R/G)versus versus

average log intensities average log intensities A = logA = log2 2 √√(R*G(R*G) )

M versus A plot M versus A plot

where R and G are the fluorescence where R and G are the fluorescence intensities in the red (or signal, or Cy5) and intensities in the red (or signal, or Cy5) and green (or control, or Cy3) channels green (or control, or Cy3) channels respectively. respectively.

Page 34: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Some Some extreme extreme examples examples here:here:

scannerscanner saturationsaturation highhigh variationvariation at at lowlow endend

highhigh variationvariation at at highhigh endend curvedcurved ((backgroundbackground effecteffect))

slopeslope differencedifferencesomethingsomething wentwent terriblyterribly wrongwrong

Page 35: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

General QC – dendrogram and NMDS

Page 36: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Scatterplots

Page 37: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Heatmaps (this took an hour to calculate)

Page 38: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

QC-tools in Chipster

Quality control• Affymetrix basic• Affymetrix RLE and NUSE• Agilent 2-color

Visualization• Dendrogram• Heatmap• Correlogram

Statistics• NMDS

Page 39: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Normalization

Page 40: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

What is normalization?

Normalization is the process of removing systematic variation from the data.Typically you would normalize your data so that all the chips become comparable.

Page 41: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Methods

Affymetrix• Background correction + expression estimation + summarization• RMA (default) uses only PM probes, fits a model to them, and gives out

expression values after quantile normalization and median polishingAgilent

• Background correction + averaging duplicate spots + normalization

After normalization the expression values are always expressed on log2-scale

Page 42: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

AffymetrixMethods: MAS5, Plier, RMA, GCRMA, Li-Wong

• MAS5 is the older Affymetrix method, Plier is a newer one• RMA is the default, and works rather nicely if you have more than a

few chips• GCRMA is similar to RMA, but takes also GC% content into account• Li-Wong is the method implemented in dChip

Variance stabilization makes the variance over all the chips similar

• Works only with MAS5 and Plier, since all others output log2-tranformed data by default (and thus corrected for the same phenomenon)

Custom chiptype• If you want to use reannotated probes (they are really assigned to

the genes where they belong), select one from this menu.

Page 43: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Agilent I

Background correction• Background treatment

None, Subtract, Edwards, Normexp• Background offset

0 or 50Normalize chips

• None, median, loessNormalize genes (not typically used)

• None, scale (to median), quantileChiptype

• A must setting!

Page 44: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Agilent IIBackground treatment typically generates many negative values that are coded as missing values after log2-transformation.

• Usual subtract option does this• Using normexp + offset 50 will generate no negative values,

and gives rather good estimates (best method reported)Loess removes curvature from the data (suggested)

Page 45: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Checking normalization

Page 46: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Filtering

Page 47: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Gene filtering

Removing probes for genes that are• Not expressed• Expressed at constant level (not changing)

Often a good idea, and necessary before multiple testing correction can be adequately applied

• Some controversy on this…

Non-specific filtering• Expression, flags, SD, …

Specific filtering• Statistical testing

Page 48: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Non-specific filtering

Often used for removing bad quality data:• Intensity value too low• Intensity value saturated• Appearance of the spot is abnormal

Typically, non-changing genes are also removedThese can be removed using

• Filter by standard deviation• Filter by interquartile range• Filter by expression

Page 49: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Specific filtering

Selecting genes that are associated with some phenotypeTypically involves statistical testing

Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value.

• Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect.

• Take both into account by combining the filters.• Filter on expression value (what is biologically significant)

and test for differences (what is statistically significant)

Page 50: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Unspecific filtering in Chipster

Pre-processing• Filter by expression

• Select the upper and lower cut-offs• Select the number of chips required to fulfil this rule• Select whether to return genes inside or outside the range

• Filter by SD• Select the percentage of genes to be filtered out

• Filter by interquartile range (IQR)• Select the IQR

• Filter by coefficient of variation (CV)• Median is used for filtering on CV (cannot be changed)

Utilities1. Calculate descriptive statistics2. Filter using a column

Page 51: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Venn diagram

Select three datasets in ChipsterRun the Venn diagram tool from the Visualization category

SD CV

IQR

Page 52: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Statistics

Page 53: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Some terminology

Usually tests for comparing means of two or more groups are used• Variance might be of interest too, but in practise this is never done.

Parametric tests (assume that the data is normally distributed)• Typically used for microarray data

Non-parametric tests (assume no normality)

P-value• Risk of saying that there is a difference when there really isn’t• Traditionally 0.05 is used as a cut-off for significance• False discovery rate is a p-value corrected for multiple tests (more on this later)

Page 54: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Mean and variance, an example for 1 gene

-6 -4 -2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

density.default(x = x1)

N = 100000 Bandwidth = 0.08956

Den

sity

-10 -5 0 5 10

0.0

0.1

0.2

0.3

0.4

density.default(x = y1)

N = 100000 Bandwidth = 0.09006

Den

sity

Page 55: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Statistical testing

Needs replication (>2 chips per group)• Replication makes it possible to estimate uncertainty or variability in the

measurements. This is typically measured by standard deviation.Comparing means (parametric tests)

• One-group tests• Compare to a known mean• Example: One-sample t-test

• Two-group tests• Compare two groups’ means• Example: Two-sample t-test

• Several group tests• Compare several groups’ means• Example: Analysis of variance (ANOVA)

• Two or more groups, two or more factors• Compare means in the groups according to both factors simultaneously• Example: multiple linear regression (linear modeling in Chipster)

Page 56: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

t-test

Compares means of two groups• If the p-value is small that means that there is a difference between the groups.• If the p-value is large (>0.05), there is no difference between the groups.• p-value is a risk of saying that there is a difference when there actually isn’t.

A test for every gene is run separately -> thousands of tests and p-values

SExxt 21 −=

Page 57: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

ANOVA

A generalization of t-test.Compares means of several groups.Tells whether the means are different, but not which means differ from each other.

• For this you can use linear modelling implemented in Chipster or post-hoc tests (not implemented in Chipster)

A test for every gene is run separately -> thousands of tests and p-values

Page 58: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Multiple testing correction I

After getting the results for all the genes, p-values are adjusted for the number of tests conducted.When making several comparisons using the same test, some of the results will be chance findings.

• Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50).

This can be corrected for (to some extent) by using a multiple testing correction.

• Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives.

• Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating.

Page 59: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Multiple testing correction II

The ranking of the genes does not change after multiple testing correction!

• If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before orafter the multiple testing correction.

• If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.

Page 60: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Gene set test (”global test”)

A typical result of a microarray experiment is a list of differentially expressed genes.Biologically, grouping these genes in pathways or functional categories would be more interesting.Are pathways associated with our endpoints of interest?

• Is there a difference in nucleotide metabolism between 5-FU-treated cancer patients and their healthy controls?

Works on the expression value data.

Page 61: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Gene enrichment analysis

A typical result of an microarray experiment is a list of differentially expressed genes.Biologically, grouping these genes in pathways or functional categories would be more interesting.Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories.Works on the gene list (doesn’t use expression values).

Page 62: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Statistical tests in Chipster

Statistics• One sample tests

• Are the genes expressed at all (different from 0)?• Two group tests• Several group tests• Linear modeling

Visualization• Volcano plot

Page 63: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Clustering

Page 64: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Clustering methods

Hierarchical clusteringNon-hierarchical clustering

• K-means• QT-clustering• Self-organizing maps

Classification aka class prediction• K-nearest neighbor (KNN)

Page 65: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Unsupervised v. supervised

Page 66: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Hierachical clustering

Two phases:• Pick a distance measure

• Euclidean distance• Standard / Pearson correlation

• Pick the dendrogram drawing method• Average linkage

Page 67: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Average linkage example

Page 68: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Hierarchical clustering - heatmap

Page 69: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

K-means clustering

Finds K clusters from the data.User has to specify the number of clusters (K).

Page 70: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

K-means clustering

Page 71: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Clustering in Chipster

Clustering• Hierarchical

• Includes reliability checking of the resulting tree with bootstrapping

• K-meansStatistics

• PCA (principal component analysis)• NMDS (non-metric multidimensional scaling)

Page 72: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Annotation

Page 73: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

AnnotationAnnotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc.Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project).Required by certain analysis tools (annotation, GO enrichment, promoter analysis, chromosomal plots)

• These tools don’t work for those chiptypes which don’t have Bioconductor annotation packages

In addition to Bioconductor annotations, Chipster offers also GeneCruiser annotations for Affymetrix data

• Access through the visualisation panel (right click/ annotate button)• Liefeld et al (2005) Bioinformatics 21:3681-3682

Page 74: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data
Page 75: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Alternative CDF environments for Affy

CDF is a file that links individual probes to their location in genes (probesets)Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genesAlternative CDFs fix this problem In Chipster

• selecting ”custom chiptype” in Affymetrix normalization takes altCDFs to use

Dai et al, (2005) Nuc Acids Res, 33(20):e175http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp

Page 76: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Promoter analysis

Page 77: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Promoter analysis with Chipster

Promoter sequences = sequences upstream of annotated transcription start site of RefSeq genes (from UCSC Golden Path)

Pattern discovery: Weeder• looks for common sequence motifs in a set of promoters

Pattern matching: ClusterBuster• looks for clusters of known transcription factor binding sites using the

JASPAR matrices

Page 78: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Promoters from genes with similar expression patterns

Pattern discovery

Program to find common motifs- Tool comparison: Nature Biotech. (2005) 23:137 => Weeder

Page 79: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

WeederEnumerates all oligos of given length, determines which appear in a significant fraction of seqs, ranks them according to statistical significancePavesi et al (2004) Nuc Acids Res. Jul (W199-203)

Species (human, mouse, rat, yeast, drosophila) [human]• Background frequency files (oligo count of intergenic regions of a given organism)

Promoter size (short, medium, long) [short] Analyze strands (single, both) [single] Motif appears more than once per sequence (yes, no) [no] Number of motifs to return (1-100) [10] Percentage of sequences the motif should appear in (1-100) [50] Transcription factor binding site size (small, medium) [small]

• Small= 6 (1 mismatch allowed) and 8 (2 mismatches allowed)• Medium= 10 (3 mismatches allowed)

Page 80: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data
Page 81: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Collection of known binding motifs for TFs (Genomatix, Transfac, JASPAR)

Program to scan the sequence for binding sites

Pattern matching

Page 82: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

TTTTTATA

Page 83: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

ClusterBuster

Looks for clusters of transcription factor binding sitesUses the JASPAR open access matrix database

• http://jaspar.cgb.ki.se/cgi-bin/jaspar_db.pl

Frith et al (2003) Nuc Acids Res, 31(13):3666-8

Species (human, mouse, rat, yeast, drosophila) [human] Promoter size (short, medium, long) [short]Cluster score threshold [5]Motif score threshold [6]Expected distance between motifs in a cluster [35]Range for counting nucleotide frequencies [100]Pseudocounts [0.375]

Page 84: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

ClusterBuster output

Page 85: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Extra material

Page 86: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Linear modeling in Chipster

Page 87: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Linear model

Y = a + bx1 + cx2 + dx1x2• Like a normal multiple regression• Intercept (a) is included by default• Can contain both main effects (b, c) and interaction effects (d)

Linear modeling in Chipster can take into account at most three main effects, their interactions, one technical replication level, and one level of pairing

• This is enough for all the experiments I’ve encountered in GEO so far.• Technical replication: one biological sample is hybridized on more

than one array• Pairing: before-after –type of setting. Measurements available prior to

treatment and after it from exactly the same cell culture flasks.

Page 88: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Setting up the model I

All columns (max. three) in the phenodata can be either tested as linear (is there a trend towards higher numbers?) or as a factor (are there differences between the groups?).

• With 2 groups there’s no difference in these settings.

1 2 3 1 2 3

linear factor

Page 89: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Liner modeling tool

Columns 1…3• Main effects

Column 4• Technical repl.

Column 5• Pairing

Page 90: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

One main effect – 3 groups

linear

factor

Page 91: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Setting up the model

If you want to include more than one main effect, you need to add new columns to your phenodata.

Page 92: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Two main effects – both have two groups

No interactions

Two-way interactions, with significant genes returned for all effects (main effects and interactions)

Page 93: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Pairing or technical replication

All samples in the same pairing or replication groups are coded with the same number. Different groups are coded with a running number.

Page 94: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Result files

A model matrix and one result file are saved.

Page 95: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Experimental design

Page 96: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Some things to ponder

Bad experimental design is bad science!• Wasted money• More animal or human suffering• Unreliable results

The main aspects of experimental design are• Randomization and balancing (often neglected)• Replication (usually rather well handled)• Blocking (not even known of)• Factorial experiments (sometimes considered)

You also need to consider• Sample size• Controls (direct or indirect measurements)

Page 97: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Before running the experiment

Define the principal hypothesis to test. Everything cannot be tested!

• ”I run this experiment for comparing two treatments on Arabidopsis. Now coming to think of it, these plants were of different age. Can you also test for the effect of it?”

Which are the main sources of variability? They need to be taken into account in the experimental design!

• Laboratory personnel (more than one person involved?)• Chips (from more than one batch?)• Biological samples (inter- or intraindividual variability?)• Hybridization conditions (is the method standardized?)• Day (often the greatest source or variation)

• Intermingled with variation from chips, biological samples, etc. if not properly taken into account

Page 98: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Replication

Techical replication:• Take a sample per animal, and hybridize every sample to

several chips. Biological replicate:

• Take a sample per animal, and hybridize every sample to one chip.

Replication does not mean taking repeated measurements from the same experimental units. That typically generates a time series.Technical replication, when analysed as a biological replicate is a pseudoreplicate. Pseudoreplication generates more problems than it solves.

Page 99: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Balancing

Balancing means that there should be an equal number of experimental units is all groups.Balanced designs are statistically more powerful than unbalanced designs.Example:

• In the study of breast cancer, 30 individuals were recruited frothe cancer cohort, and 30 individuals as their health controls (balanced for the disease).

• 60 Affymetrix chips are available for hybridizing these samples.Affymetrix station only takes 8 chips at a time, so 4 cancer patients and 4 healthy controls are randomly picked to be hybridized in every batch (balanced for day effect).

• Two laboratory technicians are making the hybridizations. Both process 30 samples, half being cancer patients and half healthy controls (balanced for the technician).

Page 100: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Randomization

Randomization is a way to control for effects of factors not explicitely taken care of in the experimental design.In randomization experimental units are randomly allocated to treatment groups.

• Sixty cell culture vials are randomly divided into control and treatment groups. They retain their places in the incubator regardless of the group (completely randomized trial).

Random does not mean haphazard. Randomization takes some effort. Use e.g., dice, playing cards, random number generator, random number tables, etc. for randomization. In the best case the randomization is blind. The experimenter must not be able to identify the samples before the whole experiment has been concluded.

Page 101: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Completely randomized design

12114

21123

12122

22211

DCBARow #

Let’s divide 16 samples into two groups of equal size. I’ve created a random number table on the right.Reading the table from the top left to the bottom right, the cell culture vials are assigned to two groups.We might then arrange the vials on the tray in the same order and put the tray in the incubator.

Page 102: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Blocking

Blocking = arranging experimental units into similar groups. Blocking is used for controlling for factor that can not be manipulated, but are known.Example:

• While studying a responce to a drug treatment, both males and females were recruited for the study. Responce might depend on sex, so individuals were first divided into two groups according to their sex, and then randomly assigned to treatment groups (randomized block design).

Page 103: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Factorial designs

In factorial design several factors are manipulated at the same time.Better to analyze together than separately, because factorial design allows one to assess the possible interaction.Example:

• Cells were treated with vitamin-C and hydroxen peroxide. Culturing cells alone with either chemical leads to missing the interaction where vitamin-C prevents peroxide induced cell death to some extent.

Main effects: vitamin-C and peroxideInteraction: vitamin-C * peroxide

Page 104: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Sample sizeWe need to use a sufficient amount of samples to reach reliable conclusions. Using too small or too big sample size is a waste of resources.Finding out the correct sample size for DNA microarray experiments is tricky. Use of previous experiments for the same chip type and biological material is often needed.In epidemiological studies estimating the sample size is a must. It might be hard to get published otherwise.To estimate the sample size, we need an estimate of

• Effect size• Variability• Desired false positive rate• Desired false negative rate

Page 105: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Sample size – a comparison of two experiments

Page 106: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Sample size – a rule of thumb

In statistics, variability is intrinsically associated with statistical significance. The lower the variability of replicates, the higher the significance.Doubling sample size halves variance making the detection of differences easier.

Page 107: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

Direct or indirect measurements?

Reference Sample ReferenceSample Sample 2Sample 2

Page 108: Microarray data analysis with Chipster 3.-4.6 · 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data

An example of a better…good design

Comparing two groups of samples.• 20 samples in each group (40 in total).

• You’re interested in comparing the two states (diseased, health).

• Interindividual variability (due to sex) can be expected.• Using Affymetrix chips (all from the same batch).• You’re doing all the wetlab work.

Hybridize (randomly ordered):• 12122211• 22112112 1=healthy• 21211212 2=diseased• 22221111 1=male• 12211212 2=female