Day 5-2

22
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Day 5-2 W h a t b i o i n f o r m a t i c s t o o l s What bioinformatics tools can be used for analysing ChIP data?

description

Day 5-2. What bioinformatics tools can be used for analysing ChIP data?. What bioinformatics tools can be used for analysing ChIP data?. After this seminar. You should be able to - PowerPoint PPT Presentation

Transcript of Day 5-2

Page 1: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Day 5-2What bioinformatics tools can be used for analysing ChIP data?

What bioinformatics tools can be used for analysing ChIP data?

Page 2: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

After this seminar

You should be able to Understand the differences between CHip-chip and CHip-Seq

and identify key decision making steps for choosing a platform

Identify bioinformatics steps needed for handling CHip-chip and Chip-Seq datasets

Understand underlying data from genome tiling arrays

Understand how to search for binding sites in genomic data

Understand the need for skills in handling large datasets

Page 3: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

General problem

Find accessible regions of DNA that are bound to your protein.

What method is best? What sort of bioinformatics skills

are required? What is real signal and what is

noise? What do we do with the regions

once you have identified them?

Zheng, M. et al. (2007) ChIP-chip: data, model, and analysis. Biometrics, Vol 63, 787-796.

Page 4: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Experimental methods give different types of data

ChIP-chip microarray data defining genomic regions

probe (with position usually defined) + expression

ChIP-Seq high throughput DNA sequence

ACGATGTCA sequence fragments (from Solexa/SOLID/454)

sequence position undefined (search required)

The same issues exist for microarray vs. deep sequencing in gene expression experiments

coverage

cost

practicality

Page 5: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Raw (sequence) data

Flat files, processed from base-calls to fasta format

Solexa ~25-30 bp reads

Barcode is used to pool samples in one sequence run

ACGT = Expt1 TGAC = Expt2 ACGT|Sequence TGAC|Sequence

Page 6: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Choice of experiment

Choice of experiment depends on the focus you require Whole genome broad coverage (of known genome)

or focused genomic region?

or discovery based (known or unknown genome)

How much coverage do you need? Fewer broad experiments vs. many focused experiments?

Custom chips can be easily designed for focused regions and custom applications.

Page 7: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Chip- Workflow

Select antibody Select chip or design and

select probes Map Array probes to

genomic positon (BLAST/BLAT or lookup table from chip supplier)

Identify peaks from data and minimise false positives

Analyse peaks to predict binding sites

Select antibody Decide how deep to

sequence ($$$ vs. coverage)

Sequence fragments Map Sequence to genomic

position (BLAST/BLAT) Identify peaks from data

and minimise false positives

Analyse peaks to predict binding sites

CHip-chip CHip-seq

Page 8: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Chip- Ringo Workflow example

Page 9: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Chip- output

BMC Bioinformatics 2007, 8:219

Peaks on the genome

“Score” for each genomic position

Page 10: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Antibody selection

Success depends on your antibody Select antibodies that are suitable for

CHip-chip experiments Only a small number so far! List available from

http://www.chiponchip.org/antibody.html

Page 11: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Microarray companies DNA microarrays suitable for ChIP-chip assays:

Affymetrix Human Chr21&22 tiling microarrays (oligonucleotide arrays) Human ENCODE tiling arrays (oligonucleotide arrays)

Agilent Custom oligonucleotide arrays

Nimblegen Systems, Inc. Human promoter microarrays Human ENCODE microarrays Custom oligonucleotide arrays

Aviva Systems Biology Hu5K promoter arrays (PCR product arrays) Hu20K promoter arrays (Oligo arrays)

Page 12: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Probe Design

Tiling high-resolution arrays target genomic regions of interest whole genome or specific targeted regions?

Agilent eArray probe database >21 million tiled CGH and ChIP-on-chip probes

Do it yourself unassembled genomes, etc...

Page 13: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Mapping to genome

The genome is still not constant, especially for many organisms

You must map the probe/sequence to genomic location using

standard alignment software (BLAST/BLAT/vmatch/...)

or rely on datafiles from the vendor (reccomended for most cases)

R packages exist for annotating probes to genomic location

Page 14: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Mapping to genome For sequence based methods this step is critical (and slow)

need unix server to run (or VMware) Do I need access to a computing cluster?

choice of parameters for short sequences Filter raw sequences -> representative sequence set Do I need to pre-filter data (some seqs will account for most

of the compute time) must be aware of speed vs. specificity for large datasets

Genome

Page 15: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Normalisation A normalization procedure:

(a) The MA plot before normalization shows a need for rotation to correct dye-bias.

(b) To determine the correct angle of rotation, the σ(M) vs σ(A) plot of the differences between probes is generated This circumvents the effect of binding signal in determining the rotating angle for original MA plot in (a).

(c) The MA plot after rotation by the angle determined in (b). The green line is the fitting line after rotation.

(d) The MA plot after normalization..

BMC Bioinformatics. 2007; 8: 219.

MA plot is a scatterplot with transformed axes. The X-axis represents the average log intensity from 2 channels while Y-axis represents the log-ratios.

Page 16: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Peak detection

What regions of DNA contain signal peaks?

How to define a statistically significant peak?

Zheng, M. et al. (2007) ChIP-chip: data, model, and analysis. Biometrics, Vol 63, 787-796.

Page 17: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Normalisation Before normalization

the mock control appears to show the same differential enrichment between genic and intergenic regions as the histone occupancy, suggesting that the differential enrichment may be an artifact.

After normalization

the mock control no longer shows significant differential enrichment while H3 and H4 profiles still do

Peng et al. BMC Bioinformatics 2007 8:219 doi:10.1186/1471-2105-8-219

Page 18: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Noise Contamination

Do sequences match the expected genome? Sequencing errors

Can you determine where a sequencing error is? Multiple-mapping sequences

Many sequences do not unique genome matches Dye specific bias

ChIP-chip data for chromatin-associated proteins and histone modifications present additional challenges

as they often display broad regions of enrichment. This is in contrast to the isolated and sharp peaks that are typical for the binding of transcription factors.

Page 19: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Peak detection - replicates Use replicates to improve

detection

Peaks that are consistent between replications are more likely to be true

Zheng, M. et al. (2007) ChIP-chip: data, model, and analysis. Biometrics, Vol 63, 787-796.

Page 20: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

What next? Given that you've identified accessible regions in the genome

What information can be gathered from this sequence? Use discovery methods to look for common patterns in the regions

MEME, etc Use TFBS databases to look for known transcription factor binding

sites in the sequence Transfac

High coverage Noisy database

Jaspar

Low coverage Higher quality

Page 21: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

R packages for chip-chip

Ringo

Well documented workflow and good tutorial BAC

Perfect example of minimal documentation Bayesian Analysis of ChIP-chip data

Page 22: Day 5-2

Alistair Chalk, Elisabet Andersson

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,18-24 September 2007.

Summary

You should be able to

Understand the differences between CHip-chip and CHip-Seq and identify key decision making steps for choosing a platform

Identify bioinformatics requirements for handling CHip-chip and Chip-Seq datasets

Find transcription factor binding sites in genomic data

Understand the need for skills in handling large datasets