Ngs workshop passarelli-mapping-1

Read Processing and Mapping:From Raw to Analysis-ready Reads

Ben Passarelli Stem Cell Institute Genome Center

NGS Workshop12 September 2012

Click to edit Master title styleSamples to Information

Variant callingGene expressionChromatin structure MethylomeImmunorepertoiresDe novo assembly…

Click to edit Master title style

http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpghttp://www.broadinstitute.org/gatk/guide/topic?name=intro

Many Analysis Pipelines Start with Read Mapping

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Genotyping (GATK) RNA-seq (Tuxedo)

http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpg

http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpg

http://www.broadinstitute.org/gatk/guide/topic?name=intro

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Click to edit Master title styleFrom Raw to Analysis-ready Reads

Raw reads

Read assessment and prep

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-readyreads

Session Topics

• Understand read data formats and quality scores• Identify and fix some common read data problems• Find and prepare a genomic reference for mapping• Map reads to a genome reference• Understand alignment output• Sort, merge, index alignment for further analysis• Locally realign at indels to reduce alignment artifacts• Mark/eliminate duplicate reads• Recalibrate base quality scores

• An easy way to get started

Click to edit Master title styleInstrument Output

IlluminaMiSeq

IlluminaHiSeq

IonTorrentPGM

Roche454

Pacific BiosciencesRS

Images (.tiff)Cluster intensity file (.cif)

Base call file (.bcl)

Standard flowgram file (.sff) MovieTrace (.trc.h5)Pulse (.pls.h5)Base (.bas.h5)

Sequence Data(FASTQ Format)

Click to edit Master title styleRaw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

FASTQ Format (Illumina Example)

@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAACAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT+BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AGAAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG+@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AGCCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC+CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG+CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ

Read RecordHeader

Read BasesSeparator

(with optional repeated header)

Read Quality Scores

Flow Cell ID

Lane TileTile

Coordinates

Barcode

NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads


Phred* quality score Q with base-calling error probability PQ = -10 log10P

* Name of first program to assign accurate base quality scores. From the Human Genome Project.

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41

Q scoreProbability of base error Base confidence

Sanger-encoded(Q Score + 33) ASCII character

10 0.1 90% “+”20 0.01 99% “5”30 0.001 99.9% “?”40 0.0001 99.99% “I”

Base Call Quality: Phred Quality Scores


[benpass@solexalign]$ ls Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

File Organization

[benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18Sample_FS53_EPCAM+_CD10+_IL2269-19Sample_COH77_CD49F-_IL2275-13Sample_COH77_CD49F+_CD66-_IL2274-14Sample_COH77_CD49F+_CD66+_IL2273-15Sample_COH74_EPCAM+_CD10-_IL2272-16Sample_COH74_EPCAM+_CD10+_IL2271-17Sample_COH69_EPCAM+_CD10-_IL2268-20Sample_COH69_EPCAM+_CD10+_IL2267-21



[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gzCOH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gzCOH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gzCOH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gzCOH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gzCOH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gzCOH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gzCOH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz



Barcode



Read



Format



gzip compressed



gzip compressed



gzip compressed

Click to edit Master title styleInitial Read Assessment

Common problems that can affect analysis• Low confidence base calls

– typically toward ends of reads– criteria vary by application

• Presence of adapter sequence in reads– poor fragment size selection– protocol execution or artifacts

• Over-abundant sequence duplicates• Library contamination

Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

Click to edit Master title styleInitial Read Assessment: FastQC

• Free DownloadDownload: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y

• Samples reads (200K default): fast, low resource use

Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/



http://www.youtube.com/watch?v=bz93ReOv87Y

http://www.youtube.com/watch?v=bz93ReOv87Y


http://proteo.me.uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc

Read Duplication

Read Assessment Examples

~8% of sampled

sequences occur twice

~6% of sequences occur more

than 10x

~71.48% of sequences are

duplicatesSanger Quality Score by CycleMedian, Inner Quartile Range, 10-90 percentile range, Mean

Note: Duplication based on read identity, not alignment at this point




Per base sequence content should resemble this…

Read Assessment Example (Cont’d)

Click to edit Master title styleRead Assessment Example (Cont’d)


TruSeq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG


Trim for base quality or adapters(run or library issue)

Trim leading bases(library artifact)

Click to edit Master title styleFastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/

(partial list)FASTQ Information: Chart Quality Statistics and Nucleotide DistributionFASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise).FASTQ Clipper: Removing sequencing adaptersFASTQ Quality Filter: Filters sequences based on qualityFASTQ Quality Trimmer: Trims (cuts) sequences based on qualityFASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality*defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option.

SepPrep https://github.com/jstjohn/SeqPrepAdapter trimmingMerge overlapping paired-end read

Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html(for python programmers)Especially useful for implementing custom/complex sequence analysis/manipulation

Galaxy http://galaxy.psu.eduGreat for beginners: upload data, point and clickJust about everything you’ll see in today’s presentations

Selected Tools to Process Reads

http://hannonlab.cshl.edu/fastx_toolkit/



https://github.com/jstjohn/SeqPrep

https://github.com/jstjohn/SeqPrep

http://biopython.org/

http://biopython.org/

http://biopython.org/DIST/docs/tutorial/Tutorial.html



http://galaxy.psu.edu/





Mapping

Local realignment

Duplicate marking


Analysis-readyreads

Read Mapping

http://www.broadinstitute.org/igv/




SOAP2(2.20) Bowtie (0.12.8) BWA

(0.6.2)Novoalign (2.07.00)

License GPL v3 LGPL v3 GPL v3 Commercial

Mismatch allowed

exactly 0,1,2 0-3 max in read user specified. max is function of read length and error rate

up to 8 or more

Alignments reported per read

random/all/none user selected user selected random/all/none

Gapped alignment

1-3bp gap no yes up to 7bp

Pair-end reads yes yes yes yes

Best alignment minimal number of mismatches

minimal number of mismatches

minimal number of mismatches

highest alignment score

Trim bases 3’ end 3’ and 5’ end 3’ and 5’ end 3’ end

Read Mapping: Aligning to a Reference

Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads


BWA Features• Uses Burrows Wheeler Transform

— fast— modest memory footprint (<4GB)

• Accurate• Tolerates base mismatches

— increased sensitivity — reduces allele bias

• Gapped alignment for both single- and paired-ended reads• Automatically adjusts parameters based on read lengths and

error rates• Native BAM/SAM output (the de facto standard)• Large installed base, well-supported• Open-source (no charge)

Read Mapping: BWA

Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

Click to edit Master title styleSequence References and Annotations

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtmlhttp://www.ncbi.nlm.nih.gov/guide/howto/dwn-genomeComprehensive reference information

http://hgdownload.cse.ucsc.edu/downloads.htmlComprehensive reference, annotation, and translation information

ftp://[email protected]/bundleReferences and SNP information data by GATKHuman only

http://cufflinks.cbcb.umd.edu/igenomes.htmlPre-indexed references and gene annotations for Tuxedo suiteHuman, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast

http://www.repeatmasker.org/

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml



http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome



http://hgdownload.cse.ucsc.edu/downloads.html



ftp://[email protected]/bundle

ftp://[email protected]/bundle

http://cufflinks.cbcb.umd.edu/igenomes.html

http://cufflinks.cbcb.umd.edu/igenomes.html



Click to edit Master title styleFasta Sequence Format

>chr1…TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggtagccaccaggcacatgcagccactgagcacttgaaatgtggatagtctgaattgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctagaaaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaataaccatatttgggatatactggattttaaaaatatatcactaatttcat…>chr2…>chr3…

• One or more sequences per file• “>” denotes beginning of sequence or contig• Subsequent lines up to the next “>” define sequence• Lowercase base denotes repeat masked base• Contig ID may have comments delimited by “|”

Click to edit Master title styleInput files:

reference.fasta, read1.fastq.gz, read2.fastq.gz

Step 1: Index the genome (~3 CPU hours for a human genome reference):bwa index -a bwtsw reference.fasta

Step 2: Generate alignments in Burrows-Wheeler transform suffix array coordinates:

bwa aln reference.fasta read1.fastq.gz > read1.saibwa aln reference.fasta read2.fastq.gz > read2.sai

Apply option –q<quality threshold> to trim poor quality bases at 3'-ends of reads

Step 3: Generate alignments in the SAM format (paired-end):bwa sampe reference.fasta read1.sai read2.sai \read1.fastq.gz read2.fastq.gz > alignment_ouput.sam

http://bio-bwa.sourceforge.net/bwa.shtml

Running BWA




Simple Form:

bwa sampe reference.fasta read1.sai read2.sai \read1.fastq.gz read2.fastq.gz > alignment.sam

Output to BAM:

bwa sampe reference.fasta read1.sai read2.sai \read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam

With Read Group Information:

bwa sampe -r "@RG\tID:readgroupID\tLB:libraryname\tSM:samplename\tPL:ILLUMINA“ \reference.fasta \read1.sai read2.sai \read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam

Running BWA (Cont’d)

Click to edit Master title styleSAM (BAM) Format

Sequence Alignment/Map format– Universal standard– Human-readable (SAM) and compact (BAM) forms

Structure – Header

version, sort order, reference sequences, read groups, program/processing history

– Alignment records

Click to edit Master title style[benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam@HD VN:1.0 GO:none SO:coordinate@SQ SN:chrM LN:16571@SQ SN:chr1 LN:249250621@SQ SN:chr2 LN:243199373@SQ SN:chr3 LN:198022430…@SQ SN:chr19 LN:59128983@SQ SN:chr20 LN:63025520@SQ SN:chr21 LN:48129895@SQ SN:chr22 LN:51304566@SQ SN:chrX LN:155270560@SQ SN:chrY LN:59373566…@RG ID:86-191 PL:ILLUMINA LB:IL500 SM:86-191-1@RG ID:BsK010 PL:ILLUMINA LB:IL501 SM:BsK010-1@RG ID:Bsk136 PL:ILLUMINA LB:IL502 SM:Bsk136-1@RG ID:MAK001 PL:ILLUMINA LB:IL503 SM:MAK001-1@RG ID:NG87 PL:ILLUMINA LB:IL504 SM:NG87-1…@RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023@PG ID:GATK IndelRealigner VN:2.0-39-gd091f72 CL:knownAlleles=[] targetIntervals=tmp.intervals.list LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null@PG ID:bwa PN:bwa VN:0.6.2-r126

samtools to view bamheadersort order

reference sequence names with lengths

read groups with platform, library and sample information

program (analysis) history

SAM/BAM Format: Header

Click to edit Master title style[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam

HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M= 27588 -188

TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC=7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/RG:Z:86-191

HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M =27664 -119

ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0RG:Z:SDH023* Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability

SAM/BAM Format: Alignment Records

http://samtools.sourceforge.net/SAM1.pdf

13 4 5 6 8 9

10

11

http://samtools.sourceforge.net/SAM1.pdf


• Subsequent steps require sorted and indexed bams– Sort orders: karyotypic, lexicographical– Indexing improves analysis performance

• Picard tools: fast, portable, freehttp://picard.sourceforge.net/command-line-overview.shtml

Sort: SortSam.jarMerge: MergeSamFiles.jarIndex: BuildBamIndex.jar

• Order: sort, merge (optional), index

Preparing for Next Steps

Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

http://picard.sourceforge.net/command-line-overview.shtml

http://picard.sourceforge.net/command-line-overview.shtml

Click to edit Master title styleLocal Realignment

Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

• BWT-based alignment is fast for matching reads to reference

• Individual base alignments often sub-optimal at indels

• Approach– Fast read mapping with BWT-based aligner– Realign reads at indel sites using gold standard (but much

slower) Smith-Waterman1 algorithm

• Benefits– Refines location of indels– Reduces erroneous SNP calls – Very high alignment accuracy in significantly less time,

with fewer resources1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238

Click to edit Master title styleLocal Realignment

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Post re-alignment at indelsRaw BWA alignment


• Covered in genotyping presentation

• Note that this is done after alignment

Raw reads


Mapping

Local realignment

Duplicate marking


Analysis-readyreads

Duplicate Marking



Mapping

Local realignment

Duplicate marking


Analysis-readyreads

STEP 1: Find covariates at non-dbSNP sites using:Reported quality scoreThe position within the readThe preceding and current nucleotide (sequencer properties)

java -Xmx4g -jar GenomeAnalysisTK.jar \-T BaseRecalibrator \-I alignment.bam \-R hg19/ucsc.hg19.fasta \-knownSites hg19/dbsnp_135.hg19.vcf \-o alignment.recal_data.grp

STEP 2: Generate BAM with recalibrated base scores:

java -Xmx4g -jar GenomeAnalysisTK.jar \-T PrintReads \-R hg19/ucsc.hg19.fasta \-I alignment.bam \-BQSR alignment.recal_data.grp \-o alignment.recalibrated.bam

Base Quality Recalibration

Click to edit Master title styleBase Quality Recalibration (Cont’d)

Click to edit Master title styleGetting Started

Is there an easier way to get started?!!

Click to edit Master title styleGetting Started

http://galaxy.psu.edu/ Click “Use Galaxy”



Click to edit Master title styleQ&A

Ngs workshop passarelli-mapping-1

Data & Analytics

Transcript of Ngs workshop passarelli-mapping-1