Microarray data analysis with Chipster 22.9.2008

61
Microarray data analysis with Chipster 22.9.2008 Jarno Tuimala

description

Microarray data analysis with Chipster 22.9.2008. Jarno Tuimala. Program – an analysis workflow. Basic functionality of Chipster Data import Quality control Normalization Describing the experiment Filtering and missing value considerations Statistical testing - PowerPoint PPT Presentation

Transcript of Microarray data analysis with Chipster 22.9.2008

Page 1: Microarray data analysis with Chipster 22.9.2008

Microarray data analysis with Chipster22.9.2008

Jarno Tuimala

Page 2: Microarray data analysis with Chipster 22.9.2008

Program – an analysis workflow

Basic functionality of Chipster Data import Quality control Normalization

• Describing the experiment

Filtering and missing value considerations Statistical testing Clustering and visualization Annotation

Page 3: Microarray data analysis with Chipster 22.9.2008

Introduction to Chipster

Page 4: Microarray data analysis with Chipster 22.9.2008

Chipster Goal: Easy access to leading analysis tools such as those developed in the

R/Bioconductor project

Features• Easy to use graphical user interface• Comprehensive selection of tools• Support for different array types (Affymetrix, Agilent, Illumina, cDNA)• Compatible with Windows, Linux and Mac OS X• Easy to install and update• Wizards and workflows• Interactive graphics • Transparency (as opposed to “black box”)• Alternative annotations for Affymetrix arrays

• Automatic tracking of performed analyses

http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdf http://chipster.csc.fi

Page 5: Microarray data analysis with Chipster 22.9.2008

How does it work?

internet

front end

SSL

SOAP

international Web Services

ANALYSIS VISUALISATION

CSC desktop

clientJava Web Startinstalls and updates client automatically

Corona/Murska

analyser

security

Page 6: Microarray data analysis with Chipster 22.9.2008
Page 7: Microarray data analysis with Chipster 22.9.2008

Data Tools

Visual

izatio

n

Page 8: Microarray data analysis with Chipster 22.9.2008

Phenodata – describing your experiment

Phenodata file is created during normalization Fill in the group column with numbers describing your experimental setup

• e.g. 1 = healthy control, 2 = cancer sample• necessary for the statistical tests to work

If you bring in previously created normalized data and phenodata:• Choose ”import directly” in the import tool• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”

If you brought in normalized data and need to create phenodata for it:• Utilities/ Generate phenodata (fill in the chiptype parameter!)• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”• Fill in the group column

Page 9: Microarray data analysis with Chipster 22.9.2008

Visualizing the data

Data visualization panel• Maximize and redraw for better viewing

Two types of visualizations1. Interactive visualizations produced by the client program

• Select the visualization method from the pulldown menu of the data visualization panel

• Save by right clicking on the image

2. Static images produced by R/Bioconductor, Weeder, etc• Select from Analysis tools/ Visualisation• View by double clicking on the image file• Save by right clicking on the file name and choosing ”Export”

Page 10: Microarray data analysis with Chipster 22.9.2008

Interactive visualizations by the client

Spreadsheet Histogram Scatterplot 3D scatterplot Expression profiles Clustered profiles Hierarchical clustering SOM clustering Array pseudo-image Venn diagram

Available actions: Change titles, colors etc Zoom in/out

Page 11: Microarray data analysis with Chipster 22.9.2008
Page 12: Microarray data analysis with Chipster 22.9.2008
Page 13: Microarray data analysis with Chipster 22.9.2008

Static images produced by R/Bioconductor

Volcano plot Box plot Histogram Heatmap Venn diagram Idiogram Chromosomal position Correlogram Dendrogram QC stats plot RNA degradation plot K-means clustering SOM-clustering

Page 14: Microarray data analysis with Chipster 22.9.2008
Page 15: Microarray data analysis with Chipster 22.9.2008

Automatic tracking of analysis history

Page 16: Microarray data analysis with Chipster 22.9.2008

Running many analyses simultaneously

You can have max 5 analysis jobs running at the same time Use Task manager to

• view parameters, status,…• cancel jobs

Page 17: Microarray data analysis with Chipster 22.9.2008

Workspace – continue later/elsewhere

Saving your workspace allows you to continue later• File/ Save workspace• File/ Load workspace

Currently it is possible to have only one workspace saved at the time

If you would like to continue your work on another computer, you need to transfer the workspace-snapshot -folder to the corresponding location

• C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot

Page 18: Microarray data analysis with Chipster 22.9.2008

Importing files

Affymetrix CEL-files are imported to Chipster automatically

Other files are imported using the Import tool

Page 19: Microarray data analysis with Chipster 22.9.2008

Import tool, step 1

Define• Header• Footer• Title row• Delimiter

Page 20: Microarray data analysis with Chipster 22.9.2008

Import tool, step 2

Define columns Modify flags

Page 21: Microarray data analysis with Chipster 22.9.2008

Importing Agilent files (required fields)

Sample (rMeanSignal) Sample background (rBGMedianSignal) Control (gMeanSignal) Control background (gBGMedianSignal) Identifier (ProbeName) Annotation (ControlType) Flag (IsManualFlag)

https://extras.csc.fi/biosciences/chipster-manual/data-formats.html

Page 22: Microarray data analysis with Chipster 22.9.2008

Quality control

Page 23: Microarray data analysis with Chipster 22.9.2008

Quality control tools

Quality control -tools• Affymetrix basic

RNA degradation + Affy QC• Agilent

MA-plot + density plot + boxplot

Visualization – dendrogram Statistics - NMDS

Page 24: Microarray data analysis with Chipster 22.9.2008

Affymetrix I

Quality control tools are run on raw data (CEL files).• Dendrogram and NMDS on normalized data

Page 25: Microarray data analysis with Chipster 22.9.2008

Agilent

Page 26: Microarray data analysis with Chipster 22.9.2008

General QC – dendrogram and NMDS

Page 27: Microarray data analysis with Chipster 22.9.2008

Scatterplots

Page 28: Microarray data analysis with Chipster 22.9.2008

Heatmaps (this took an hour to calculate)

Page 29: Microarray data analysis with Chipster 22.9.2008

QC-tools in Chipster

Quality control• Affymetrix basic• Affymetrix RLE and NUSE• Agilent 2-color

Visualization• Dendrogram• Heatmap• Correlogram

Statistics• NMDS

Page 30: Microarray data analysis with Chipster 22.9.2008

Normalization

Page 31: Microarray data analysis with Chipster 22.9.2008

What is normalization?

Normalization is the process of removing systematic variation from the data.

Typically you would normalize your data so that all the chips become comparable.

Page 32: Microarray data analysis with Chipster 22.9.2008

Methods

Affymetrix• Background correction + expression estimation + summarization• RMA (default) uses only PM probes, fits a model to them, and gives out

expression values after quantile normalization and median polishing

Agilent• Background correction + averaging duplicate spots + normalization

After normalization the expression values are always expressed on log2-scale

Page 33: Microarray data analysis with Chipster 22.9.2008

Affymetrix

Methods: MAS5, Plier, RMA, GCRMA, Li-Wong• MAS5 is the older Affymetrix method, Plier is a newer one• RMA is the default, and works rather nicely if you have more than a

few chips• GCRMA is similar to RMA, but takes also GC% content into account• Li-Wong is the method implemented in dChip

Variance stabilization makes the variance over all the chips similar

• Works only with MAS5 and Plier, since all others output log2-tranformed data by default (and thus corrected for the same phenomenon)

Custom chiptype• If you want to use reannotated probes (they are really assigned to

the genes where they belong), select one from this menu.

Page 34: Microarray data analysis with Chipster 22.9.2008

Agilent I

Background correction• Background treatment

None, Subtract, Edwards, Normexp• Background offset

0 or 50

Normalize chips• None, median, loess

Normalize genes (not typically used)• None, scale (to median), quantile

Chiptype• A must setting!

Page 35: Microarray data analysis with Chipster 22.9.2008

Agilent II

Background treatment typically generates many negative values that are coded as missing values after log2-transformation.

• Usual subtract option does this• Using normexp + offset 50 will generate no negative values,

and gives rather good estimates (best method reported)

Loess removes curvature from the data (suggested)

Page 36: Microarray data analysis with Chipster 22.9.2008

Checking normalization

Page 37: Microarray data analysis with Chipster 22.9.2008

Filtering

Page 38: Microarray data analysis with Chipster 22.9.2008

Gene filtering

Removing probes for genes that are• Not expressed• Expressed at constant level (not changing)

Often a good idea, and necessary before multiple testing correction can be adequately applied

• Some controversy on this…

Non-specific filtering• Expression, flags, SD, …

Specific filtering• Statistical testing

Page 39: Microarray data analysis with Chipster 22.9.2008

Non-specific filtering

Often used for removing bad quality data:• Intensity value too low• Intensity value saturated• Appearance of the spot is abnormal

Typically, non-changing genes are also removed These can be removed using

• Filter by standard deviation• Filter by interquartile range• Filter by expression

Page 40: Microarray data analysis with Chipster 22.9.2008

Specific filtering

Selecting genes that are associated with some phenotype

Typically involves statistical testing

Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value.

• Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect.

• Take both into account by combining the filters.• Filter on expression value (what is biologically significant)

and test for differences (what is statistically significant)

Page 41: Microarray data analysis with Chipster 22.9.2008

Unspecific filtering in Chipster

Pre-processing• Filter by expression

• Select the upper and lower cut-offs• Select the number of chips this rule has to fulfilled on• Select whether to return genes inside or outside the range

• Filter by SD• Select the percentage of genes to filter out

• Filter by interquartile range (IQR)• Select the IQR

• Filter by coefficient of variation (CV)• Median is used for filtering on CV (cannot be changed)

Utilities1. Calculate descriptive statistics2. Filter using a column

Page 42: Microarray data analysis with Chipster 22.9.2008

Venn diagram

Select three datasets in Chipster Run the Venn diagram tool from Visualization tool

category

SD CV

IQR

Page 43: Microarray data analysis with Chipster 22.9.2008

Statistics

Page 44: Microarray data analysis with Chipster 22.9.2008

Some terminology

Usually tests for comparing means of two or more groups are used

• Variance might be of interest too, but in practise this is never done.

Parametric tests (assume data normally distributed)• Typically used for microarray data

Non-parametric tests (assume no normality)

P-value• Risk of saying that there is a difference when there really isn’t

• Traditionally 0.05 is used as a cut-off for significance

• False discovery range is a p-value corrected for multiple tests (more on this later)

Page 45: Microarray data analysis with Chipster 22.9.2008

Mean and variance, an example for 1 gene

-6 -4 -2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

density.default(x = x1)

N = 100000 Bandwidth = 0.08956

De

nsi

ty

-10 -5 0 5 10

0.0

0.1

0.2

0.3

0.4

density.default(x = y1)

N = 100000 Bandwidth = 0.09006

De

nsi

ty

Page 46: Microarray data analysis with Chipster 22.9.2008

Statistical testing

Needs replication (>2 chips per group)• Replication makes it possible to estimate uncertainty or variability in the

measurements. This is typically measured by standard deviation.

Comparing means (parametric tests)• One-group tests

• Compare to a known mean

• Example: One-sample t-test

• Two-group tests

• Compare two groups’ means

• Example: Two-sample t-test

• Several group tests

• Compare several groups’ means

• Example: Analysis of variance (ANOVA)

• Two or more groups, two or more factors

• Compare means in the groups according to both factor simultaneously

• Example: multiple linear regression (linear modeling in Chipster)

Page 47: Microarray data analysis with Chipster 22.9.2008

t-test

Compares means of two groups• If the p-value is small that means that there is a difference between the groups.

• If the p-value is large (>0.05), there is no difference between the groups.

• p-value is a risk of saying that there is a difference when there actually isn’t.

A test for every gene is run separately -> thousands of tests and p-values

SE

xxt 21

Page 48: Microarray data analysis with Chipster 22.9.2008

ANOVA

A generalization of t-test. Compares means of several groups. Tells whether the means are different, but not which

means differ from each other.• For this you can use post-hoc tests (not implemented in

Chipster) or linear modelling (implemented in Chipster)

A test for every gene is run separately -> thousands of tests and p-values

Page 49: Microarray data analysis with Chipster 22.9.2008

Multiple testing correction I

After getting the results for all the genes, p-values are adjusted for the number of tests conducted.

When making several comparisons using the same test, some of the results will be chance findings.

• Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50).

This can be corrected for (to some extent) by using a multiple testing correction.

• Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives.

• Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating.

Page 50: Microarray data analysis with Chipster 22.9.2008

Multiple testing correction II

The ranking of the genes does not change after multiple testing correction!

• If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before or after the multiple testing correction.

• If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.

Page 51: Microarray data analysis with Chipster 22.9.2008

Gene set test (”global test”)

A typical result of an microarray experiment is a list of differentially expressed genes.

Biologically, grouping these genes in pathways or functional categories would be more interesting.

Are pathways associated with our endpoints of interest?

• Is there a difference in nucleotide metabolism between 5-FU-treated cancer patients and their healthy controls?

Works on the expression values data.

Page 52: Microarray data analysis with Chipster 22.9.2008

Gene enrichment analysis

A typical result of an microarray experiment is a list of differentially expressed genes.

Biologically, grouping these genes in pathways or functional categories would be more interesting.

Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories.

Works on the gene list.

Page 53: Microarray data analysis with Chipster 22.9.2008

Statistical tests in Chipster

Statistics• One sample tests

• Are the genes expressed at all (different from 0)?• Two group tests• Several group tests• Linear modeling

Visualization• Volcano plot

Page 54: Microarray data analysis with Chipster 22.9.2008

Clustering

Page 55: Microarray data analysis with Chipster 22.9.2008

Clustering methods

Hierarchical clustering Non-hierarchical clustering

• K-means• QT-clustering• Self-organizing maps

Classification / class prediction• K-nearest neighbor (KNN)

Page 56: Microarray data analysis with Chipster 22.9.2008

Hierachical clustering

Two phases:• Pick a distance measure

• Euclidean distance• Standard / Pearson correlation

• Pick the dendrogram drawing method• Average linkage

Page 57: Microarray data analysis with Chipster 22.9.2008

Average linkage example

Page 58: Microarray data analysis with Chipster 22.9.2008

Hierarchical clustering - heatmap

Page 59: Microarray data analysis with Chipster 22.9.2008

Annotation

Page 60: Microarray data analysis with Chipster 22.9.2008

Annotation

Annotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc.

Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project).

Required by certain analysis tools (annotation, GO enrichment, promoter analysis, chromosomal plots)

• These tools don’t work for those chiptypes which don’t have Bioconductor annotation packages

Page 61: Microarray data analysis with Chipster 22.9.2008

Alternative CDF environments for Affy

CDF is a file that links individual probes to their location in genes (probesets)

Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genes

Alternative CDFs fix this problem In Chipster

• selecting ”custom chiptype” in Affymetrix normalization takes altCDFs to use• Note: if you have normalized using a custom chiptype, certain tools requiring

annotation won’t work (GO term enrichment, promotor analysis, annotation)

Dai et al, (2005) Nuc Acids Res, 33(20):e175 http://brainarray.mbni.med.umich.edu/Brainarray/Database/

CustomCDF/genomic_curated_CDF.asp