Workshop NGS data analysis - 3
-
Upload
mate-ongenaert -
Category
Education
-
view
807 -
download
2
description
Transcript of Workshop NGS data analysis - 3
Sequencing data analysisWorkshop – part 3 / peak calling and annotation
Outline
Previously in this workshop…
Peak calling and annotation – the steps
Peak calling and annotation – the workshop
Maté Ongenaert
Previously in this workshop…Introduction – the real cost of sequencing
Previously in this workshop…Introduction – the real cost of sequencing
Data analysis
Raw machine reads… What’s next?
Preprocessing (machine/technology)- adaptors, indexes, conversions,…- machine/technology dependent
Reads with associated qualities (universal)- FASTQ
- QC check
Depending on application (general applicable)- ‘de novo’ assembly of genome (bacterial genomes,…)
- Mapping to a reference genome mapped reads- SAM/BAM/…
High-level analysis (specific for application)- SNP calling- Peak calling
Previously in this workshop…The workflow of NGS data analysis
Previously in this workshop…The workflow of NGS data analysis
Raw sequence reads:
- Represent the sequence ~ FASTA
- Extension: represent the quality, per base ~ FASTQ – Q for qualityScore ~ phred ~ ASCII table ~ phred + 33 = Sanger
@SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
>SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
Previously in this workshop…Main data formats
- Machine and platform independent and compressed: SRA (NCBI)Get the original FASTQ file using SRATools (NCBI)
Previously in this workshop…Main data formats
- Now moving to a common file format SAM / BAM (Sequence Alignment/Map)- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM
DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION # QNAME: template name #FLAG #RNAME: reference name # POS: mapping position #MAPQ: mapping quality #CIGAR: CIGAR string #RNEXT: reference name of the mate/next fragment #PNEXT: position of the mate/next fragment #TLEN: observed template length #SEQ: fragment sequence #QUAL: ASCII of Phred-scale base quality+33 #Headers @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 #Alignment block r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
- BED files (location / annotation / scores): Browser Extensible DataUsed for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED: # chr # start # end # name # score # strand track name=pairedReads description="Clone Paired Reads" useScore=1 #chr start end name score strand chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 –
- BEDGraph files (location, combined with score)Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 #chr start end score chr19 59302000 59302300 -1.0 chr19 59302300 59302600 -0.75 chr19 59302600 59302900 -0.50
Previously in this workshop…Main data formats
- WIG files (location / annotation / scores): wiggleUsed for visulization or summarize data, in most cases count data or normalized count data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)
browser position chr19:59304200-59310700 browser hide all #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5
Previously in this workshop…Main data formats
- GFF format (General Feature Format) or GTFUsed for annotation of genetic / genomic features – such as all coding genes in EnsemblOften used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED: # seqname (the name of the sequence) # source (the program that generated this feature) # feature (the name of this type of feature – for example: exon) # start (the starting position of the feature in the sequence) # end (the ending position of the feature) # score (a score between 0 and 1000) # strand (valid entries include '+', '-', or '.') # frame (if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.) # group (all lines with the same group are linked together into a single item) track name=regulatory description="TeleGene(tm) Regulatory Regions" #chr source feature start end scores tr fr group chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2
Previously in this workshop…Main data formats
Peak calling:
Identify genomic regions where the number of sequenced reads (coverage) of the IP-sample is higher than can be estimated from the input (control) samples >> enriched regions >> possibly captured by the IP & thus sequenced with more coverage
Peak annotation:
When such enriched regions are identified, where are they located (intron/exon/…) ? What is the closest gene or the closest promoter region?
Peak callingThe workflow
Peak calling:
Coverage
From the BAM file: mapping against the reference genomeBoth the IP-sample and the control (Input) must be mapped, duplicates will be ignored by most peak callers
Peak caller will determine coverage for both samples- Store them for visualisation (WIG files; BIGWIG files or similar)
Enriched
Find out which regions are enriched (or within the sample or versus a control (Input) sample statistics ~ model of tag distributions and normalisation strategy
Peak callingThe workflow
Peak calling:
Enriched
Find out which regions are enriched (or within the sample or versus a control (Input) sample statistics ~ model of tag distributions and normalisation strategy
Peak callingThe workflow
Density profiles Peak assignment Control data adjustment Significance relative to control
data Statistical model / test
Program Reference Window-
based Tag
clustering Gaussian
kernel Strand-specific
Peak height or FE
Bacground subtract
Genomic dupl/deletions FDR
Normalized control
Statistical model on
control
Conditional binomial
Local poisson
Chromome poisson HMM T-test
Cisgenome [73] X X X X X X Minimal ChipSeq
Peak Finder [74] X X X
E-range [75] X X X X X MACS [76] X X X X X QuEST [77] X X X X X Hpeak [78] X X X X
Sole-Search [79] X X X X X
PeakSeq [80] X X X X SISSRS [81] X X X
spp package [82] X X X X X
Peak callingThe workflow
Usage: macs14 <-t tfile> [-n name] [-g genomesize] [options]
Example: macs14 -t ChIP.bam -c Control.bam -f BAM -g h -n test -w --call-subpeaks
macs14 -- Model-based Analysis for ChIP-Sequencing
Options: --version show program's version number and exit -h, --help show this help message and exit. -t TFILE, --treatment=TFILE ChIP-seq treatment files. REQUIRED. When ELANDMULTIPET is selected, you must provide two files separated by comma, e.g. s_1_1_eland_multi.txt,s_1_2_eland_multi.txt -c CFILE, --control=CFILE Control files. When ELANDMULTIPET is selected, you must provide two files separated by comma, e.g. s_2_1_eland_multi.txt,s_2_2_eland_multi.txt -n NAME, --name=NAME Experiment name, which will be used to generate output file names. DEFAULT: "NA" -f FORMAT, --format=FORMAT Format of tag file, "AUTO", "BED" or "ELAND" or "ELANDMULTI" or "ELANDMULTIPET" or "ELANDEXPORT" or "SAM" or "BAM" or "BOWTIE". The default AUTO option will let MACS decide which format the file is. Please check the definition in 00README file if you choose EL AND/ELANDMULTI/ELANDMULTIPET/ELANDEXPORT/SAM/BAM/BOWTI E. DEFAULT: "AUTO"
Peak annotation
Enriched
Peak locations > in which features is my peak located; is it close to a gene; provide me some statistics on how far my peaks are from annotated TSSes
R/BioConductorChipPeakAnno package
PeakAnalyzer
Peak callingThe workflow
Sequencing data analysisWorkshop – part 3 / peak calling and annotation
Outline
Previously in this workshop…
Peak calling and annotation – the steps
Peak calling and annotation – the workshop
Maté Ongenaert
Further downstream processingPeak overlaps
Peak callingThe workflow
Is this observed overlap larger than one can expect if the datasets were random?
Peak caller gives each peak a score
Randomy distribute this score accross the peaks of the same peakset (factor) and, for a percentage of top-peaks, calculate overlapping peaks in real dataset and with random distributed scores
Further downstream processingIdentify sequence motifs (region around ‘peak’, searched for motifs)
Peak callingThe workflow
Further downstream processingIdentify differentially bound regions between conditions/factors/…
Further downstream processingPeak overlaps
Peak callingThe workflow
Real 10% 15% 20% 30% 50% 75%
7 18 25 52 102 201
Means 0,347 1,153 2,699 9,297 42,377 140,888
Factor diff 20,17291066 15,6114484 9,262689885 5,593202108 2,406966043 1,426665152
FDR 10% 15% 20% 30% 50% 75%
0 0 0 0 0 0
10% 10% 15% 20% 30% 50% 75%
282 333 506 907 1000 1000
20% 10% 15% 20% 30% 50% 75%
59 33 125 332 1000 1000
30% 10% 15% 20% 30% 50% 75%
4 2 9 27 981 1000
50% 10% 15% 20% 30% 50% 75%
2 0 0 0 95 1000
75% 10% 15% 20% 30% 50% 75%
0 0 0 0 0 148
Sequencing data analysisWorkshop – part 3 / peak calling and annotation
Outline
Previously in this workshop…
Peak calling and annotation – the steps
Peak calling and annotation – the workshop
Maté Ongenaert
Blokde Van…
ETER