Next Generation Sequencing Data Analysis: From ChIP-Seq ...
Transcript of Next Generation Sequencing Data Analysis: From ChIP-Seq ...
![Page 1: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/1.jpg)
Next Generation Sequencing Data Analysis:
From ChIP-Seq Read Islands to Epigenomics Information
WBS Sommer Seminar, July 4th 2013 M. Jaritz, IMP
![Page 2: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/2.jpg)
Outline
• Brief intro to experimental techniques • Next generation sequencing raw results • Track quality measurements • Read densities, heatmaps & profiles • Peak calling & overlaps • Motif finding
![Page 3: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/3.jpg)
Immune System & B cell development
http://www.niaid.nih.gov/ by Bojan Vilagos
+ site specific recombinase technology (Cre-Lox system) -> KO
![Page 4: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/4.jpg)
Questions: For each cell stage, ...
... which chromatin marks are associated with each gene?
... is the gene’s DNA accessible to interacting proteins?
... which gene is bound by a transcription factor?
... which gene is expressed?
... how is the correlation between transcription factor binding, transcription and chromatin state?
... how does all this change between cell stages?
![Page 5: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/5.jpg)
Gene expression
Image: www.studyblue.com
Transcription factor
Chromatin marks
Accessible DNA
Expression
![Page 6: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/6.jpg)
Sequencing experiment protocols
6
![Page 7: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/7.jpg)
Associated Next Generation Sequencing methods
• RNA-Seq
• ChIP-Seq
• DNase-Seq
• GRO-Seq
• CAGE sample extraction
sample preparation
sequencing
cluster generation
sequencing reads (~10-100 million/sample)
![Page 8: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/8.jpg)
Bioinformatics Pipeline
• Image analysis & Base calling • Quality control (did the sequencing work?) • Sequence alignment (assembly) + QC • Read quantification (expressed? bound?
chromatin state? open chromatin? ...) • Genome wide summaries • Comparison between technical or biological
replicates • Comparison across cell/genotype boundaries
![Page 9: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/9.jpg)
Raw data ~115000 images per flowcell (36 bases)
Flowcell images: Whiteford et al., Bioinformatics 2009; Illumina; contig.worldpress.com, flxex 2010
@HW I-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATN +HW I-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 efcf f f f fc feeff fcf f f f f fddf` feed]` ]_Ba_^__[YBBBBBBBB
Base calling
Alignment
Aligned data - genome browser
RNA-seq
DNase-seq
TF-seq
read
s
chromosome
![Page 10: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/10.jpg)
Aligned reads genome view
RNA-seq
DNase-seq
TF-seq
read
s
chromosome
TSS enhancer
expressed gene
![Page 11: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/11.jpg)
Some important key ChIP-Seq quality check numbers
Alignment statistics Read duplication
Strand cross correlation Correlation of technical replicates
![Page 12: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/12.jpg)
Spotfire webplayer & Integrated Genome Browser (IGB)
![Page 13: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/13.jpg)
TSS Read density across the genome ~
2100
0 TS
S
5kb
read density
![Page 14: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/14.jpg)
How to find piles of read densities: "peak calling"
Nature Reviews Genetics Pepke et al., Nature Methods, 2009 Zhang et al., Genome Biology, 2008
![Page 15: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/15.jpg)
Peak calling results re
ads
peaks
![Page 16: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/16.jpg)
Overlap of two track's peaks
1
2
![Page 17: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/17.jpg)
Peak calling & sample size
4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
0050—
040520
8900—
860000
7750—
680480
6600—
500960
5450—
320440
4300728140920
3150—
960400
2000—
780880
0850—
600360
700—
420840
550—
240320
537686302687
516382802483
495079302280
474075802075
452572301870
431068801667
410065301463
388761801260
367458301055
34605480850
32505130647
Marking:
Marking
Color by
sample_name
Pax5:Rag2--
Ebf1:biobio:Rag2--
Ebf1:Pax5--Rag2--Ebf1:Pax5--
E2A:biobio:Rag2--E2a:Pax5--
E2a:Ebf1--
random subsamples of million reads
peak
num
ber - different sample sizes
- different numbers of peaks called - saturation issue
![Page 18: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/18.jpg)
Peak overlaps of multiple tracks
0X0000
num
ber o
f pea
ks overlap
example 1 ----------- 2 -------- 3 ---- 4 ------
![Page 19: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/19.jpg)
Finding the most common sequence motif in a peak
For each peak: • get all reads, remove duplicates
• shift + extend reads (3 methods)
• get location and read count of summit (peak max)
• get ratio of reads around peaks over rest of peak (“peakiness”)
• recenter the peak
• extract associated peak sequences (masked/unmasked)
• select 300 “best” peaks, according to “peakiness” score (sort) and min deviation of shift method results (max location) >= 100
• meme-chip (meme [de novo], dreme [de novo], mast [scan], tomtom [motif compare], centrimo [centralenriched]
• search full peak sequence set with obtained motifs (mast)
• also for random(shuffled sequences)
![Page 20: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/20.jpg)
Bioinformatics pipeline & outlook
• Build a database of all our tracks (resource) • Compute quality values of ChIP-Seq tracks • Call peaks (TF/HS/Chromatin)
– calculate peak saturation • Assign peaks to genes • Find motifs (where applicable) • Compare (biological) replicates • Perform data analysis/mining
– Identify hotspots – Find interesting Heatmaps, densities, clusters ...
![Page 21: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/21.jpg)
Data != Publications
2008-2013 888 samples 234 sequencing requests 606 “lanes” 126 geno types 55 cell types
Revilla-I-Domingo R, Bilic I, Vilagos B, Tagoh H, Ebert A, Tamir IM, Smeenk L, Trupke J, Sommer A, Jaritz M, Busslinger M. EMBO J. 2012 Vilagos B, Hoffmann M, Souabni A, Sun Q, Werner B, Medvedovic J, Bilic I, Minnich M, Axelsson E, Jaritz M, Busslinger M. J Exp Med. 2012 McManus S, Ebert A, Salvagiotto G, Medvedovic J, Sun Q, Tamir I, Jaritz M, Tagoh H, Busslinger M. EMBO J. 2011 Ebert A, McManus S, Tagoh H, Medvedovic J, Salvagiotto G, Novatchkova M, Tamir I, Sommer A, Jaritz M, Busslinger M. Immunity. 2011
![Page 22: Next Generation Sequencing Data Analysis: From ChIP-Seq ...](https://reader030.fdocuments.in/reader030/viewer/2022012508/6184ce9b29cd663146452730/html5/thumbnails/22.jpg)
Acknowledgements
• Meinrad Busslinger & lab members, IMP • NGS Bioinformatics office
– Elin Axelsson, Malgorzata Goiser, Roman Stocsits
• Campus Science Support Facilities – Andreas Sommer (NGS) – Bioinformatics: Ido Tamir, Heinz Ekker (NGS) – Andras Aszodi (SCC)
• Bioinformatics Services – Wolfgang Lugmayr (linux cluster)
• Open source tools (R, NGS tools, peak callers, Galaxy ...) • Commercial tools: Spotfire, Ingenuity