Canadian Bioinformatics Workshops

97
Canadian Bioinformatics Workshops www.bioinformatics.ca

description

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Lecture 8 Microarrays II: Data Analysis. MBP1010 Dr. Paul C. Boutros Winter 2014. †. Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE). D EPARTMENT OF - PowerPoint PPT Presentation

Transcript of Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

www.bioinformatics.ca

2Module #: Title of Module

Lecture 8Microarrays II: Data Analysis

MBP1010

Dr. Paul C. BoutrosWinter 2014

DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS

This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others

††

††

Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

Lecture 8: Microarrays Part II bioinformatics.ca

Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Sequence Analysis• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Machine-Learning• Final Exam (written)

Lecture 8: Microarrays Part II bioinformatics.ca

House Rules• Cell phones to silent

• No side conversations

• Hands up for questions

Lecture 8: Microarrays Part II bioinformatics.ca

Topics For This Week• Examples

• Attendance

• Pre-Processing

• QA/QC

• Microarray-Specific Statistics

• ProbeSet remapping

• Organizing –omics studies

Lecture 8: Microarrays Part II bioinformatics.ca

Example #1You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: tumours in mice lacking TS will be smaller than those in mice with amplification of OG, as assessed by post-mortem volume measurements of the primary tumour. Your data:

TS (cm3)3.97.13.14.45.0

OG (cm3)5.21.95.06.14.54.8

Lecture 8: Microarrays Part II bioinformatics.ca

Example #2You are conducting a study of osteosarcomas using mouse models. You are studying transgenic animals with deletion of a tumour suppressor (TS), or with amplification of an oncogene (OG). You consider the penetrance of tumours in a set of 8 different mouse strains.Your hypothesis: some mouse strains are lead to bigger tumours than others when OG is amplified and only considering animals in which tumours form. You measure tumour volume in mm3 using calipers.

Strain 1 (mm3)916983

Strain 2 (mm3)2017071

Strain 3 (mm3)153620

Strain 4 (mm3)525253

Strain 5 (weeks)11

53859

Strain 6 (mm3)6

6063

Strain 7 (mm3)857970

Strain 8 (mm3)100105121

Lecture 8: Microarrays Part II bioinformatics.ca

Example #3You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS are less likely to respond to a novel targeted therapeutic (DX) than wildtype animals, as assessed by molecular imaging:

TS (imaging response)YesNoYesYesNo

WT (imaging response)YesYesYesYesNoYes

Lecture 8: Microarrays Part II bioinformatics.ca

Example #4You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Based on your previous data, you now hypothesize that mice lacking TS will show a similar molecular response to DX as those with amplification of OG. You use microarrays to study 20,000 genes in each line, and identify the following genes as changed between drug-treated and vehicle-treated:

TS (DX-responsive genes)MYC KRAS CD53CDH1 FBW1 SEPT7MUC1 MUC3 MUC9RNF3

OG (DX-responsive genes)MYC KRAS CD53CDH1 MUC1 MARCH1PTEN IDH3 ESR2RHEB CTCF STK11MLL3 KEAP1 NFE2L2ARID1A

Lecture 8: Microarrays Part II bioinformatics.ca

Example #5You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice naturally susceptible to these tumours at ~20% penetrance. You are studying two transgenic lines, one with deletion of a tumour suppressor (TS), the other with amplification of an oncogene (OG). Tumour penetrance in these is 100%.Your hypothesis: You now wonder if tumour size is differing by age of the animal, and suspect tumour-size differs between lines, but is confounded by age differences. Your data:

TS (cm3)3.9 (17 weeks)7.1 (15 weeks)3.1 (15 weeks)4.4 (22 weeks)5.0 (22 weeks)

OG (cm3)5.2 (17 weeks)1.9 (9 weeks)

5.0 (15 weeks)6.1 (15 weeks)4.5 (21 weeks)4.8 (20 weeks)

Wildtype (cm3)1.1 (9 weeks)

1.5 (10 weeks)2.1 (15 weeks)2.5 (15 weeks)0.3 (17 weeks)2.2 (21 weeks)

Lecture 8: Microarrays Part II bioinformatics.ca

Example #6You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS will acquire tumours sooner than wildtype mice. You test the mice weekly using ultrasound imaging. Your data:

TS (week of tumour)47765

OG (week of tumour)393243

Lecture 8: Microarrays Part II bioinformatics.ca

Topics For This Week• Examples

• Attendance

• Pre-Processing

• QA/QC

• Microarray-Specific Statistics

• ProbeSet remapping

• Organizing –omics studies

Lecture 8: Microarrays Part II bioinformatics.ca

Summary Point #1:

Microarray data is analyzed with a pipeline of sequential algorithms.

This pipeline defines the standard workflow for microarray experiments.

Lecture 8: Microarrays Part II bioinformatics.ca

Quantitation

Cy3 Cy5Spot

SpotQuality

Intra-ArrayInter-array

Spot List

Clustering

Background

SignificanceTesting

Integration ?

Lecture 8: Microarrays Part II bioinformatics.ca

Summary Point #2:This is an active research area.

Lecture 8: Microarrays Part II bioinformatics.ca

Summary Point #3:

These basic steps hold true for all microarray platforms and types.

Lecture 8: Microarrays Part II bioinformatics.ca

What Is BioConductor?

“Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data.”

- BioConductor website

The vast majority of our analyses will use BioConductor code, but there are clearly non-BioConductor approaches.The vast majority of our analyses will use BioConductor code, but there are clearly non-BioConductor approaches.

Module 1 bioinformatics.ca

I’ve outlined the general workflow.

Each technology and application has its own unique characteristics to consider.

Module 1 bioinformatics.ca

Let’s Define an Affymetrix-Specific Workflow

Module 1 bioinformatics.ca

Quantitation

Cy3 Cy5Spot

SpotQuality

Intra-ArrayInter-array

Spot List

Clustering

Background

SignificanceTesting

Integration ?

Quantitation is done according to Affymetrix defaults with minimal user intervention.

Quantitation is done according to Affymetrix defaults with minimal user intervention.

One-Channel arrayOne-Channel array

Typically ignoredTypically ignored

Single-Channel array, so one simultaneous normalization procedure

Single-Channel array, so one simultaneous normalization procedure

Module 1 bioinformatics.ca

Let’s Collapse This a Bit And Re-Phrase Things

Module 1 bioinformatics.ca

.CELFiles.CELFiles

Background Normalization

ProbeSetAnnotation

Spot List

Integration

?

StatisticsClustering

Module 1 bioinformatics.ca

First let’s go Back to Pre-Processing

What exactly is pre-processing (aka normalization)?

What exactly is pre-processing (aka normalization)?

Why do we do it?Why do we do it?

Module 1 bioinformatics.ca

Sources of Technical Noise

Where does technical noise come from?

Module 1 bioinformatics.ca

More Sources of Technical Noise

Module 1 bioinformatics.ca

Any step in the experimental pipeline can introduce artifactual noise

• Array design• Array manufacturing• Sample quality• Sample identity sequence effects?• Sample processing• Hybridization conditions ozone?• Scanner settings

Pre-Processing tries to remove these systematic effectsPre-Processing tries to remove these systematic effects

Module 1 bioinformatics.ca

Important Note

Pre-processing is never a substitute for good experimental design. This is not a course on statistical design, but a few basic principles should be mentioned.

Pre-processing is never a substitute for good experimental design. This is not a course on statistical design, but a few basic principles should be mentioned.

Always try to balance experimental groups.Always try to balance experimental groups.

Biological replicates are preferable to technical

replicates.

Biological replicates are preferable to technical

replicates.

If processing samples identically is not possible, include controls for processing-effects.

If processing samples identically is not possible, include controls for processing-effects.

Lecture 8: Microarrays Part II bioinformatics.ca

Pre-Processing

What exactly is pre-processing (aka normalization)?

What exactly is pre-processing (aka normalization)?

Why do we do it?Why do we do it?

Lecture 8: Microarrays Part II bioinformatics.ca

Sources of Technical Noise

Where does technical noise come from?

Lecture 8: Microarrays Part II bioinformatics.ca

More Sources of Technical Noise

Lecture 8: Microarrays Part II bioinformatics.ca

Any step in the experimental pipeline can introduce artifactual noise• Array design• Array manufacturing• Sample quality• Sample identity sequence effects?• Sample processing• Hybridization conditions ozone?• Scanner settings

Pre-Processing tries to remove these systematic effectsPre-Processing tries to remove these systematic effects

Lecture 8: Microarrays Part II bioinformatics.ca

Affymetrix Pre-Processing Steps

1. Background Correction

2. Normalization

3. Probe-Specific Adjustment

4. Summarizing multiple Probes into a single ProbeSet

Let’s look at two common approachesLet’s look at two common approaches

Module 1 bioinformatics.ca

Introducing Two Major Affymetrix Pre-Processing Methods

• The two most commonly used methods are:• RMA = Robust Multi-array• MAS5 = Microarray Analysis Suite version 5

• MAS5 has strengths & weaknesses• Sacrifices precision for accuracy• Can easily be used in clinical settings

• RMA has strengths & weaknesses• Sacrifices accuracy for precision• Challenging to integrate multiple studies• Reduces variance (critical for small-n studies)

• Both are well accepted by journals and reviewers, perhaps RMA a bit more so. We’ll talk about some of the mathematics later on in this course.

Lecture 8: Microarrays Part II bioinformatics.ca

Approach #1: MAS5

• Affymetrix put significant effort into developing good data pre-processing approaches

• MAS5 was an attempt to develop a “standard” technique for 3’ expression arrays

• The flaws of MAS5 led to an influx of research in this area.

• The algorithm is best-described in an Affymetrix white-paper, and is actually quite challenging to reproduce exactly in R.

Lecture 8: Microarrays Part II bioinformatics.ca

MAS5 Model

Observations = True Signal + Random Noise + Probe EffectsObservations = True Signal + Random Noise + Probe Effects

Assumptions?Assumptions?

Lecture 8: Microarrays Part II bioinformatics.ca

MAS5: Background & NoiseBackground

•Divide chip into zones

•Select lowest 2% intensity values

•stdev of those values is zone variability

•Background at any location is the sum of all zones background, weighted by 1/((distance^2) + fudge factor)

Noise

•Using same zones as above

•Select lowest 2% background

•stedev of those values is zone noise

•Noise at any location is the sum of all zone noise as above

•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

Lecture 8: Microarrays Part II bioinformatics.ca

MAS5: Adjusted Intensity

A = Intensity minus background, the final value should be > noise.

A: adjusted intensityI: measured intensityb: backgroundNoiseFrac: default 0.5 (another fudge factor)

And the value should always be >=0.5 (log issues)(fudge factor)

•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

Lecture 8: Microarrays Part II bioinformatics.ca

MAS5: Ideal MismatchBecause Sometimes MM > PM

•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

Lecture 8: Microarrays Part II bioinformatics.ca

MAS5: Signal

Value for each probe:

Modified mean of probe values:

Scaling Factor (Sc default 500)

Tbi = Tukey Biweight (mean estimate, resistant to outliers)TrimMean = Mean less top and bottom 2%

•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

ReportedValue(i) = nf * sf * 2 (SignalLogValuei)Signal(nf=1)

Lecture 8: Microarrays Part II bioinformatics.ca

Why do we use a “robust” method?

Robust summaries really improve over the standard ones by down weighing outliers and leaving their effects visible in residuals.

Why do we use “array”?

To put each chip’s values in the context of a set of similar values.

RMA = Robust Multi-Array

What is RMA?

Lecture 8: Microarrays Part II bioinformatics.ca

What is RMA?

Assumes all the chips have the same background distribution

Does not use the mismatch probe (MM) data from the microarray experiments

It is a log scale linear additive model

Why?

Lecture 8: Microarrays Part II bioinformatics.ca

What is RMA?

Mismatch probes (MM) definitely have information - about both signal and noise - but using it without adding more noise is a challenge

We should be able to improve the background correction using MM, without having the noise level blow up: topic of current research (GCRMA)

Ignoring MM decreases accuracy but increases precision

Lecture 8: Microarrays Part II bioinformatics.ca

Methodology

Quantile Normalization – the goal of this method is to make the distribution of probe intensities for each array in a set of arrays the same. This method is motivated by the idea that a Q-Q plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is anything else.

Lecture 8: Microarrays Part II bioinformatics.ca

Methodology

Lecture 8: Microarrays Part II bioinformatics.ca

Methodology

Summarization: combining multiple probe intensities of each probeset to produce expression values

An additive linear model is fit to the normalized data to obtain an expression measure for each probe on the GeneChip

Yij = aj + βi + εij

Lecture 8: Microarrays Part II bioinformatics.ca

Methodology

Yij = aj + βi + εij

Yij denotes the background-corrected normalized probe value corresponding to the ith GeneChip and the jth probe within the probeset [log2(PM-BG)*

ij]

εij is the random error term

aj is the probe affinity jth probe

βi is the chip effect for the ith GeneChip (log scale expression level)

Lecture 8: Microarrays Part II bioinformatics.ca

Methodology

Yij = aj + βi + εij

Estimate aj ( probe affinity) and βi (chip effect) using a robust method:

• Tukey’s Median polish (quick) - fits iteratively, successively removing row and column medians, and accumulating the terms, until the process stabilizes. The residuals are what is left at the end

Lecture 8: Microarrays Part II bioinformatics.ca

RMA vs. MAS5

• RMA sacrifices accuracy for precision

• RMA is generally not appropriate for clinical settings

• RMA provides higher sensitivity/specificity in some tests

• RMA reduces variance (critical for small-n studies)

• RMA is better accepted by journals and reviewers

Lecture 8: Microarrays Part II bioinformatics.ca

Topics For This Week• Examples

• Attendance

• Pre-Processing

• QA/QC

• Microarray-Specific Statistics

• ProbeSet remapping

• Organizing –omics studies

Lecture 8: Microarrays Part II bioinformatics.ca

One key detail has been omitted so far:

How do we know if our pre-processing actually worked?

How do we know if our pre-processing actually worked?

Lecture 8: Microarrays Part II bioinformatics.ca

Can we determine how well our pre-processing worked?

Or if our data looks good?

Lecture 8: Microarrays Part II bioinformatics.ca

Let’s See Some “Bad” Data

Lecture 8: Microarrays Part II bioinformatics.ca

Lecture 8: Microarrays Part II bioinformatics.ca

Lecture 8: Microarrays Part II bioinformatics.ca

Lecture 8: Microarrays Part II bioinformatics.ca

Those Three Were From A Spike-In Experiment Done by Affymetrix

Lecture 8: Microarrays Part II bioinformatics.ca

Lecture 8: Microarrays Part II bioinformatics.ca

Lecture 8: Microarrays Part II bioinformatics.ca

Lecture 8: Microarrays Part II bioinformatics.ca

Those Last Three Were From An Experiment We Did On Rat Liver Samples

Lecture 8: Microarrays Part II bioinformatics.ca

Were Those Bad Samples?• Lots of evident spatial artifacts

• But in practice all samples were carried forward into analysis

• And validation (RT-PCR) confirmed the overall study results for many genes

Lecture 8: Microarrays Part II bioinformatics.ca

Eye-ball Assessments Are Hard• A couple of useful tricks:

• Look at the distributions• Did quantile normalization work (for RMA)?

• Look at the inter-sample correlations• Is one sample a strong outlier?

• Look at the 3’ 5’ trend across a ProbeSet

I know of no accepted, systematic QA/QC methodsI know of no accepted, systematic QA/QC methods

Lecture 8: Microarrays Part II bioinformatics.ca

Distributions (Raw)

Lecture 8: Microarrays Part II bioinformatics.ca

Distributions (normalized)

Lecture 8: Microarrays Part II bioinformatics.ca

Inter-Sample Correlations

Lecture 8: Microarrays Part II bioinformatics.ca

3’ 5’ Signal Trend

Lecture 8: Microarrays Part II bioinformatics.ca

What Do You Do If You Find a Bad Array?• Repeat it?

• Drop the sample?

• Include it but account for the “noise” in another way?

Lecture 8: Microarrays Part II bioinformatics.ca

In This Case• We excluded a series of outlier samples

• We believed these samples had been badly degraded because their were derived from FFPE blocks

Lecture 8: Microarrays Part II bioinformatics.ca

Final Distribution

Lecture 8: Microarrays Part II bioinformatics.ca

Final Heatmap

Lecture 8: Microarrays Part II bioinformatics.ca

Topics For This Week• Examples

• Attendance

• Pre-Processing

• QA/QC

• Microarray-Specific Statistics

• ProbeSet remapping

• Organizing –omics studies

Lecture 8: Microarrays Part II bioinformatics.ca

T-tests

• What are the assumptions of the t-test?

• When would you feel comfortable using a t-test?

Lecture 8: Microarrays Part II bioinformatics.ca

T-Test Alternative: Wilcoxon Rank-Sum• Also called:

• U-test• Mann-Whitney (U) test

• Some argue that for continuous microarray data there is rarely a good reason to use this test:• Low n: tests of normality are not very powerful• High n: the central limit theorem provides support

• If the sample is normal, asymptotic efficiency is 0.95

Lecture 8: Microarrays Part II bioinformatics.ca

T-Test Alternative: Moderated Statistics• A series of highly complex methods based on Bayesian

statistical methodologies

• Gordon Smyth’s limma R package is by far the most widely used implementation of this technique

This term is “shrunk” by borrowing power across all genes. This increases effective power.

This term is “shrunk” by borrowing power across all genes. This increases effective power.

Lecture 8: Microarrays Part II bioinformatics.ca

T-Test Alternative: Permutation Tests

• SAM is the classic method• Most people suggest not using SAM today

• Empirically estimate the null distribution

Start with many samplesStart with many samples Randomly SampleRandomly Sample

IterateIterate

Lecture 8: Microarrays Part II bioinformatics.ca

Problems with Significance Testing

• What happens if there are NO changes?

• Imagine:• You analyzed 1,000 clinical samples• 20,000 genes in the genome• P < 0.05

• What if… somebody comes and randomizes all your data?

Lecture 8: Microarrays Part II bioinformatics.ca

You had a lot of Data

20,000 genes / array

AllRandomized

1,000 patients

20,000,000 data points

What happens if you analyze this data?

There should be NO real hits anymore!

Genes are mixed up togetherPatients are mixed together

Lecture 8: Microarrays Part II bioinformatics.ca

What will you actually find?

Array: 20,000 genes

Threshold: p < 0.05

20,000 x 0.05 = 1000 False Positives

This is called “multiple testing”.

There is a solution

Lecture 8: Microarrays Part II bioinformatics.ca

A “false-discovery rate adjustment” (FDR) for multiple testing considers all 20,000 p-

values simultaneously

In this experiment, lots of low p-values, so we can use this to “adjust” the p-values so we can find the true hits.

P-Value

Expected Value

0%

5%

10%

15%

20%

Lecture 8: Microarrays Part II bioinformatics.ca

In this experiment, NO enrichment for low p-values,

so no more hits than expected randomly.

This is what you get from randomized data…

Lecture 8: Microarrays Part II bioinformatics.ca

Topics For This Week• Examples

• Attendance

• Pre-Processing

• QA/QC

• Microarray-Specific Statistics

• ProbeSet remapping

• Organizing –omics studies

Lecture 8: Microarrays Part II bioinformatics.ca

The Mask Production Makes Affymetrix Designs Expensive To Change

Photolithographic mask

Lecture 8: Microarrays Part II bioinformatics.ca

But… there are multiple probes per gene

Lecture 8: Microarrays Part II bioinformatics.ca

We Can Change Those Mappings!

HybridizedChip

HybridizedChip

Lecture 8: Microarrays Part II bioinformatics.ca

CDF File• Chip Definition File

• This file maps Probes (positions) into ProbeSets

• We can update those mappings• Ignore deprecated or cross-hybridizing probes• Merge multiple probes that recognize the same gene• Account for entirely new genes that were not known at the time

of array-design

Lecture 8: Microarrays Part II bioinformatics.ca

Sequence Mappings Are Slow

• Requires aligning millions of 25 bp probes against the transcriptome and identifying the best match for each

• Fortunately, other groups have done this for us, and regularly update their mappings

Lecture 8: Microarrays Part II bioinformatics.ca

Many Probes Are Lost

Lecture 8: Microarrays Part II bioinformatics.ca

But There Is Also A Major Benefit

Increased validation rates using RT-PCR (~10%)

Increased validation rates using RT-PCR (~10%)

 Sandberg et alBMC Bioinformatics2007

 Sandberg et alBMC Bioinformatics2007

Lecture 8: Microarrays Part II bioinformatics.ca

Topics For This Week• Examples

• Attendance

• Pre-Processing

• QA/QC

• Microarray-Specific Statistics

• ProbeSet remapping

• Organizing –omics studies

Lecture 8: Microarrays Part II bioinformatics.ca

What Are The Outputs of A Microarray Study?

• Primary Data• Raw image (.DAT file)• Quantitation (.CEL file)

• Secondary Data• Normalized data (usually an ASCII text file)• QA/QC plots

• Tertiary Data• Statistical analyses• Global visualization (e.g. heatmaps)• Downstream analyses (e.g. pathway, dataset-integration)

These file can be 10s of GB for a typical Affy study

These file can be 10s of GB for a typical Affy study

Lecture 8: Microarrays Part II bioinformatics.ca

How Do You Organize These Data?

/data//data/

I recommend you put things on a fast, backed-up network drive I recommend you put things on a fast, backed-up network drive

/data/Project/data/Project

Organize data by projectOrganize data by project

/data/Project/raw/data/Project/QAQC/data/Project/pre-processing/data/Project/statistical/data/Project/pathway

/data/Project/raw/data/Project/QAQC/data/Project/pre-processing/data/Project/statistical/data/Project/pathway

Create separate directories for each analysisCreate separate directories for each analysis

Lecture 8: Microarrays Part II bioinformatics.ca

How Do You Organize The Scripts?

I recommend you write a separate script for each analysis, and put those in a standardized (backed-up!) location, mirroring the directory structure and naming of your dataset directories.

Some sub-structure here is often useful:

I recommend you write a separate script for each analysis, and put those in a standardized (backed-up!) location, mirroring the directory structure and naming of your dataset directories.

Some sub-structure here is often useful:

/scripts/Project/pre-processing.R/scripts/Project/statistical-univariate.R/scripts/Project/statistical-multivariate.R/scripts/Project/pathway/GOMiner.R/scripts/Project/pathway/Reactome.R/scripts/Project/integration/mRNA+CNV.R/scripts/Project/integration/public-data.R

/scripts/Project/pre-processing.R/scripts/Project/statistical-univariate.R/scripts/Project/statistical-multivariate.R/scripts/Project/pathway/GOMiner.R/scripts/Project/pathway/Reactome.R/scripts/Project/integration/mRNA+CNV.R/scripts/Project/integration/public-data.R

Lecture 8: Microarrays Part II bioinformatics.ca

Why Many Small Scripts?

• Monolithic scripts are hard to maintain• Easier to make errors

• Accidentally re-using the same variable name• Harder to debug

• Harder for somebody else to learn

• Small scripts are more flexible• Quicker to modify/re-run a small part of your analysis• Easier to re-use the same code on another dataset

• This is akin to the “unix” mindset of systems design

Lecture 8: Microarrays Part II bioinformatics.ca

What To Save?• Everything!!

• All QA/QC plots (common reviewer request)• All pre-processed data (needed for GEO uploads)• Gene-wise statistical analyses

• Not just the statistically-significant genes• Collapse all analyses into one file, though

• All plots/etc

• Using clear filenames is critical• Disk-space is not usually a critical concern here

• Your raw data will be much larger than your output!

Lecture 8: Microarrays Part II bioinformatics.ca

Most Important Points• Do not delete things:

• Keep all old versions of your scripts by including the date in the filename (or using source-control)

• Version output files by date• I have needed to go back to analyses done 7 years prior!

• Make regular (weekly) backups:• Try to pass this work off to professional sysadmins• External hard-drives/USBs are okay if you cannot get access to

network drives, but try to automate

Lecture 8: Microarrays Part II bioinformatics.ca

Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Sequence Analysis• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Machine-Learning• Final Exam (written)