Steve Newhouse 28 Jan 2011. Practical guide to processing next generation sequencing data No...

44
Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster Steve Newhouse 28 Jan 2011

Transcript of Steve Newhouse 28 Jan 2011. Practical guide to processing next generation sequencing data No...

Page 1: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux

Cluster

Steve Newhouse28 Jan 2011

Page 2: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Practical guide to processing next generation sequencing data

No details on the inner workings of the software/code & theory

Based on the 1KG pipeline from the Broad Institute using their Genome Analysis Tool Kit (GATK).

Focus on Illumina paired-end sequence data Alignment with BWA or Novoalign SNP & Indel calling with GATK NB: This is one way processing the data that

works well

Overview

Page 3: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

BRC Cluster Software : http://compbio.brc.iop.kcl.ac.uk/cluster/software.php Maq: http://maq.sourceforge.net/ Fastqc : http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Fastx: http://hannonlab.cshl.edu/fastx_toolkit/ cmpfastq.pl :  http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php BWA: http://bio-bwa.sourceforge.net/bwa.shtml Novoalign: http://www.novocraft.com Genome Analysis Toolkit:

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit PICARD TOOLS: http://picard.sourceforge.net/ SAMTOOLS: http://samtools.sourceforge.net/ VCFTOOLS: http://vcftools.sourceforge.net/ FASTQ Files : http://en.wikipedia.org/wiki/FASTQ_format, SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdf PHRED Scores: http://en.wikipedia.org/wiki/Phred_quality_score Next Generation Sequencing Library: http://ngslib.genome.tugraz.at http://seqanswers.com http://www.broadinstitute.org/gsa/wiki/index.php/File:Ngs_tutorial_depristo_121

0.pdf

Main Tools/Resources

Page 4: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert Sequence Alignment/Map (SAM) to BAM

Local realignment around Indels Remove duplicates Base Quality Score Recalibration Variant Discovery

Analysis Pipeline

Page 5: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Work flow Illumina Raw fastq

Convert Illumina Fastq to sanger Fastq

QC raw data

Mapping (BWA, QC-BWA, Novoalign)

Convert SAM to BAM

Local realignment around Indels

Remove duplicates

Base Quality Score Recalibration

Analysis-readyreads

Indels & SNPs

Page 6: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Fastq Format :*_sequence.txt; ◦ s_1_1_sequence.txt = lane 1, read 1◦ s_1_2_sequence.txt = lane 1, read 2

Text file storing both nucleotide sequence and quality scores.

Both the sequence letter and quality score are encoded with a single ASCII character for brevity.

Standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer

http://en.wikipedia.org/wiki/FASTQ_format

What does the raw data look like?

Page 7: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Raw Data :-

@315ARAAXX090414:8:1:567:552#0

TGTTTCTTTAAAAAGGTAAGAATGTTGTTGCTGGGCTTAGAAATATGAATAACCATATGCCAGATAGATAGATGGA

+

;<<=<===========::==>====<<<;;;:::::99999988887766655554443333222211111000//

@315ARAAXX090414: the unique instrument name 8: flowcell lane 1: tile number within the flowcell lane 567: 'x'-coordinate of the cluster within the tile 552: 'y'-coordinate of the cluster within the tile # :index number for a multiplexed sample (0 for no indexing) 0 :the member of a pair, /1 or /2 (paired-end or mate-pair reads only) http://en.wikipedia.org/wiki/FASTQ_format

FASTQ Format

Page 8: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Illumina Raw fastq

Convert Illumina Fastq to sanger Fastq

QC raw data

Mapping (BWA, QC-BWA, Novoalign)

Convert SAM to BAM

Local realignment around Indels

Remove duplicates

Base Quality Score Recalibration

Analysis-readyreads

Indels & SNPs

Convert Illumina Fastq to sanger FastqQC raw data

Page 9: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Convert Illumina Fastq to sanger Fastq

$: maq ill2sanger s_1_1_sequence.txt foo.1.sanger.fastq$: maq ill2sanger s_1_2_sequence.txt foo.2.sanger.fastq

Quality Control & Pre-Alignment Processing.1

Page 10: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

FastQC: Provides a simple way to do some quality control checks on raw sequence data. ◦ Quick impression of whether the data has problems.◦ Import of data from BAM, SAM or FastQ◦ Summary graphs and tables to quickly assess your data◦ Export of results to an HTML based permanent report◦ Offline operation to allow automated generation of reports without running the

interactive application

$: fastqc foo.1.sanger.fastq;$: fastqc foo.2.sanger.fastq;

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

Quality Control & Pre-Alignment Processing.2

Page 11: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

FastQC

Page 12: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Illumina Raw fastq

Convert Illumina Fastq to sanger Fastq

QC raw data

Mapping (BWA, QC-BWA, Novoalign)

Convert SAM to BAM

Local realignment around Indels

Remove duplicates

Base Quality Score Recalibration

Analysis-readyreads

Indels & SNPs

Mapping (BWA, QC-BWA, Novoalign)Convert SAM to BAM

Page 13: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Available genomes◦ Homo_sapiens_assembly18.fasta ◦ human_b36_both.fasta ◦ human_g1k_v37.fasta (1000 genomes)

Indexed for use with BWA or Novoalign

Location: /scratch/data/reference_genomes/human

Human reference sequences and dbSNP reference metadata are available in a tarball: ◦ ftp://ftp.broadinstitute.org/pub/gsa/gatk_resources.tgz

Reference genomes

Page 14: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

## Align with BWA

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta

$: bwa aln -t 8 $REF foo.1.sanger.fastq > foo.1.sai;

$: bwa aln -t 8 $REF foo.2.sanger.fastq > foo.2.sai;

## Generate alignment in the SAM format

$: bwa sampe $REF foo.1.sai foo.2.sai foo.1.sanger.fastq foo.2.sanger.fastq > foo.bwa.raw.sam;

## Sort bwa SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file

$: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \

INPUT= foo.bwa.raw.sam OUTPUT= foo.bwa.raw.bam;

## samtools index

$: samtools index foo.novo.raw.bam;

Use option -q15 if the quality is poor at the 3' end of reads http://bio-bwa.sourceforge.net/bwa.shtml

BWA: Burrows-Wheeler Aligner

Page 15: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Fastx: http://hannonlab.cshl.edu/fastx_toolkit◦ QC filter raw sequence data ◦ always use -Q 33 for sanger phred scaled data (-Q 64)

$: cat foo.1.sanger.fastq | \

fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \

fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \

fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \

fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \

fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \

fastx_artifacts_filter -Q 33 -v | \

fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.1.sanger.qc.fastq;

$: cat foo.2.sanger.fastq | \

fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \

fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \

fastx_clipper -Q 33 -l 20 –v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \

fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \

fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \

fastx_artifacts_filter -Q 33 -v | \

fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.2.sanger.qc.fastq;#

BWA : with pre-alignment QC.1

Page 16: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Compare QCd fastq files◦ One end of each read could be filtered out in QC◦ BWA cant deal with mixed PE & SE data◦ Need to id reads that are still paired after QC◦ Need to id reads that are no longer paired after QC

$: perl cmpfastq.pl foo.1.sanger.qc.fastq foo.2.sanger.qc.fastq

Reads matched on presence/absence of id's in each file : ◦ foo.1.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#1◦ foo.2.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#2

Output: 2 files for each *QC.fastq file

◦ foo.1.sanger.qc.fastq-common-out (reads in foo.1.* == reads in foo.2.*) ◦ foo.1.sanger.qc.fastq-unique-out (reads in foo.1.* not in foo.2.*)◦ foo.2.sanger.qc.fastq-commont-out◦ foo.2.sanger.qc.fastq-unique-out

BWA : with pre-alignment QC.2

 http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php

Page 17: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Align with BWA

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta$: bwa aln -t 8 $REF foo.1.sanger.qc.fastq-common-out > foo.1.common.sai;$: bwa aln -t 8 $REF foo.2.sanger.qc.fastq-common-out > foo.2.common.sai;$: bwa aln -t 8 $REF foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sai;$: bwa aln -t 8 $REF foo.1.sanger.qc.fastq-unique-ou > foo.2.unique.sai;

Multi threading option : -t N http://bio-bwa.sourceforge.net/bwa.shtml

BWA : with pre-alignment QC.3

Page 18: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Generate alignments in the SAM/BAM format

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta

## bwa sampe for *common*

$: bwa sampe $REF foo.1.common.sai foo.2.common.sai foo.1.sanger.qc.fastq-common-out foo.2.sanger.qc.fastq-common-out > foo.common.sam;

## bwa samse for *unique*

$: bwa samse $REF foo.1.unique.sai foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sam;

$: bwa samse $REF foo.2.unique.sai foo.2.sanger.qc.fastq-unique-out > foo.2.unique.sam;

## merge SAM files using PICARD TOOLS MergeSamFiles.jar - this will also sort the BAM file

$: java -jar MergeSamFiles.jar INPUT=foo.common.sam INPUT=foo.1.unique.sam INPUT=foo.2.unique.sam ASSUME_SORTED=false VALIDATION_STRINGENCY=SILENT OUTPUT=foo.bwa.raw.bam;

## samtools index

samtools index foo.bwa.raw.bam;

Details SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdf

BWA : with pre-alignment QC.4

Page 19: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Has options for adaptor stripping and quality filters – and much more

More accurate than BWA but slower unless running MPI version

$1,990/year for full set of tools – worth it!

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37

$: novoalign -d $REF -F STDFQ -f foo.1.sanger.fastq foo.2.sanger.fastq \

-a GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG ACACTCTTTCCCTACACGACGCTCTTCCGATCT \

-r Random -i PE 200,50 -c 8 --3Prime -p 7,10 0.3,10 -k -K foo.novo.test \

-o SAM $'@RG\tID:foo\tPL:Illumina\tPU:Illumina\tLB:tumour\tSM:foo' \

> foo.novo.stats > foo.novo.raw.sam;

http://www.novocraft.com

Novoalign : quality aware gapped aligner.1

Page 20: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Novoalign produces a name sorted SAM file which needs to be co-ordinate sorted for downstream processing

## Sort novo SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file

$: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \ INPUT= foo.novo.raw.sam OUTPUT= foo.novo.raw.bam;

## samtools index

$: samtools index foo.novo.raw.bam;

Novoalign : quality aware gapped aligner.2

Page 21: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Local realignment around Indels Remove duplicate reads Base Quality Score Recalibration

GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

PICARD TOOLS: http://picard.sourceforge.net SAMTOOLS: http://samtools.sourceforge.net

Many other quality stats/options for processing files using these tools : see web documentation

Post Aligment processing of BAM files

Page 22: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Illumina Raw fastq

Convert Illumina Fastq to sanger Fastq

QC raw data

Mapping (BWA, QC-BWA, Novoalign)

Convert SAM to BAM

Local realignment around Indels

Remove duplicates

Base Quality Score Recalibration

Analysis-readyreads

Indels & SNPs

Local realignment around Indels

Page 23: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Sequence aligners are unable to perfectly map reads containing insertions or deletions◦ Alignment artefacts ◦ False positives SNPs

Steps to the realignment process: ◦ Step 1: Determining (small) suspicious intervals

which are likely in need of realignment◦ Step 2: Running the realigner over those intervals◦ Step 3: Fixing the mate pairs of realigned reads

http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels

Local realignment around Indels.1

Page 24: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Local realignment around Indels.2

Original BAM file

forRealigner.intervals (interval file)

Realigned BAM file

RealignerTargetCreator (GATK)

IndelRealigner (GATK)

Co-ordinate sorted Realigned BAM file

SortSam (PICARD)

Co-ordinate sorted Realigned fixed BAM file

FixMateInformation (PICARD)

Page 25: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta$:

ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod

$: TMPDIR=~/scratch/tmp$: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar

## 1. Creating Intervals : RealignerTargetCreator $: java –Xmx5g -jar $GATK -T RealignerTargetCreator -R $REF -D $ROD \-I foo.novo.bam -o foo.novo.bam.forRealigner.intervals;

## 2. Realigning : IndelRealigner $: java -Djava.io.tmpdir=$TMPDIR –Xmx5g -jar $GATK -T IndelRealigner \-R $REF -D $ROD -I foo.novo.bam -o foo.novo.realn.bam \-targetIntervals foo.novo.bam.forRealigner.intervals;

## samtools index$: samtools index foo.novo.realn.bam;

Local realignment around Indels.3

Page 26: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

## 3. Sort realigned BAM file using PICARD TOOLS SortSam.jar

## GATK IndelRealigner produces a name sorted BAM

$: java –Xmx5g -jar SortSam.jar \

INPUT= foo.novo.realn.bam OUTPUT=foo.novo.realn.sorted.bam \

SORT_ORDER=coordinate TMP_DIR=$TMPDIR VALIDATION_STRINGENCY=SILENT;

## samtools index

$: samtools index foo.novo.realn.soretd.bam;

## 4. Fixing the mate pairs of realigned reads using Picard tools FixMateInformation.jar

$: java -Djava.io.tmpdir=$TMPDIR -jar -Xmx6g FixMateInformation.jar \

INPUT= foo.novo.realn.sorted.bam OUTPUT= foo.novo.realn.sorted.fixed.bam \

SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=$TMPDIR;

## samtools index

samtools index foo.novo.realn.sorted.fixed.bam ;

Local realignment around Indels.3

Page 27: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Illumina Raw fastq

Convert Illumina Fastq to sanger Fastq

QC raw data

Mapping (BWA, QC-BWA, Novoalign)

Convert SAM to BAM

Local realignment around Indels

Remove duplicates

Base Quality Score Recalibration

Analysis-readyreads

Indels & SNPs

Remove duplicates

Page 28: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Examine aligned records in the supplied SAM or BAM file to locate duplicate molecules and remove them

$: TMPDIR=~/scratch/tmp## Remove duplicate reads with Picard tools MarkDuplicates.jar $: java -Xmx6g –jar MarkDuplicates.jar \INPUT= foo.novo.realn.sorted.fixed.bam \OUTPUT= foo.novo.realn.duperemoved.bam \METRICS_FILE=foo.novo.realn.Duplicate.metrics.file \REMOVE_DUPLICATES=true \ASSUME_SORTED=false TMP_DIR=$TMPDIR \VALIDATION_STRINGENCY=SILENT;## samtools indexsamtools index foo.novo.realn.duperemoved.bam;

Remove duplicate reads

Page 29: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Illumina Raw fastq

Convert Illumina Fastq to sanger Fastq

QC raw data

Mapping (BWA, QC-BWA, Novoalign)

Convert SAM to BAM

Local realignment around Indels

Remove duplicates

Base Quality Score Recalibration

Analysis-readyreads

Indels & SNPs

Base Quality Score RecalibrationAnalysis-ready reads

Page 30: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Correct for variation in quality with machine cycle and sequence context

Recalibrated quality scores are more accurate Closer to the actual probability of mismatching the reference

genome

Done by analysing the covariation among several features of a base ◦ Reported quality score◦ The position within the read◦ The preceding and current nucleotide (sequencing chemistry effect) observed

by the sequencing machine◦ Probability of mismatching the reference genome◦ Known SNPs taken into account (dbSNP 131)

Covariates are then used to recalibrate the quality scores of all reads in a BAM file

http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Base Quality Score Recalibration.1

Page 31: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Base Quality Score Recalibration.2

Co-ordinate sorted Realigned fixed BAM file

Covariates table (.csv)

Final Recalibrated BAM file

CountCovariates

TableRecalibraion

Recalibrated covariatestable (.csv)

CountCovariates

AnalyzeCovariates

Post-recalibrationanalysis plots

Pre-recalibrationanalysis plots

AnalyzeCovariates

Page 32: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

## set env variables

$: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta

$: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod

## 1. GATK CountCovariates

java -Xmx8g -jar $GATK -T CountCovariates -R $REF --DBSNP $ROD \

-I foo.novo.realn.duperemoved.bam \-recalFile foo.novo.realn.duperemoved.bam.recal_data.csv \

--default_platform Illumina \

-cov ReadGroupCovariate \

-cov QualityScoreCovariate \

-cov CycleCovariate \

-cov DinucCovariate \

-cov TileCovariate \

-cov HomopolymerCovariate \

-nback 5;

http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Base Quality Score Recalibration.3

Page 33: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar$: GATKacov=/share/apps/gatk_20100930/Sting/dist/AnalyzeCovariates.jar$: GATKR=/share/apps/gatk_20100930/Sting/R$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta$: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod$: Rbin=/share/apps/R_current/bin/Rscript

## 2. GATK AnalyzeCovariatesjava -Xmx5g –jar $GATKacov \-recalFile foo.novo.realn.duperemoved.bam.recal_data.csv \-outputDir foo.novo.realn.duperemoved.bam.recal.plots \-resources $GATKR \-Rscript $Rbin;

http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Base Quality Score Recalibration.4

Page 34: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Generate the final analysis ready BAM file for Variant Discovery and Genotyping

## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta$: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod

## 3. GATK TableRecalibration $: java –Xmx6g -jar $GATK -T TableRecalibration -R $REF \-I foo.novo.realn.duperemoved.bam \--out foo.novo.final.bam \-recalFile foo.novo.realn.duperemoved.bam.recal_data.csv \--default_platform Illumina;

##samtools index$: samtools index foo.novo.final.bam;

http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Base Quality Score Recalibration.5

Page 35: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Illumina Raw fastq

Convert Illumina Fastq to sanger Fastq

QC raw data

Mapping (BWA, QC-BWA, Novoalign)

Convert SAM to BAM

Local realignment around Indels

Remove duplicates

Base Quality Score Recalibration

Analysis-readyreads

Indels & SNPsSNP & Indel calling with GATK

Page 36: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

SNP & Indel calling with GATKFinal Recalibrated BAM file

IndelGenotyperV2

gatk.raw.indels.verbose.output.bed

gatk.raw.indels.bed

gatk.raw.indels.detailed.output.vcf

gatk.indels.filtered.bed

filterSingleSampleCalls.pl

gatk.filtered.indels.simple.bed...chr1 556817 556817 +G:3/7chr1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21 chr1 3778838 3778838 +A:15/48...

UnifiedGenotyper

makeIndelMask.py

gatk.raw.snps.vcf

VariantFiltration

indels.mask.bed

gatk.filtered.snps.vcf

Page 37: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

## set env variables

$: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar

$: GATKPERL=/share/apps/gatk_20100930/Sting/perl

$: GATKPYTHON=/share/apps/gatk_20100930/Sting/python

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta

$: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod

## Generate raw indel calls

$: java -Xmx6g -jar $GATK -T IndelGenotyperV2 -R $REF --DBSNP $ROD \

-I foo.novo.final.bam \

-bed foo.gatk.raw.indels.bed \

-o foo.gatk.raw.indels.detailed.output.vcf \

--metrics_file foo.gatk.raw.indels.metrics.file \

-verbose foo.gatk.raw.indels.verbose.output.bed \

-minCoverage 8 -S SILENT –mnr 1000000;

## Filter raw indels

$: perl $GATKPERL/filterSingleSampleCalls.pl --calls foo.gatk.raw.indels.verbose.output.bed \

--max_cons_av_mm 3.0 --max_cons_nqs_av_mm 0.5 --mode ANNOTATE > foo.gatk.filtered.indels.bed

http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

Short Indels (GATK IndelGenotyperV2)

Page 38: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

The output of the IndelGenotyper is used to mask out SNPs near indels.

“makeIndelMask.py” creates a bed file representing the masking intervals based on the output of IndelGenotyper.

$: GATKPYTHON=/share/apps/gatk_20100930/Sting/python

## Create indel mask file

$: python $GATKPYTHON/makeIndelMask.py foo.gatk.raw.indels.bed 5 indels.mask.bed

The number in this command stands for the number of bases that will be included on either side of the indel.

http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

Creating an indel mask file

Page 39: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

## set env variables

$: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar

$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta

$: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod

## SNP Calling

$: java -Xmx5g -jar $GATK -T UnifiedGenotyper -R $REF -D $ROD \

-baq CALCULATE_AS_NECESSARY -baqGOP 30 -nt 8 \

-A DepthOfCoverage -A AlleleBalance -A HaplotypeScore -A HomopolymerRun -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A SpanningDeletions \

-I foo.novo.final.bam -o foo.gatk.raw.snps.vcf \

-verbose foo.gatk.raw.snps.vcf.verbose -metrics foo.gatk.raw.snps.vcf.metrics;

This results in a VCF (variant call format) file containing raw SNPs.◦ VCF is a text file format. It contains meta-information lines, a header line, and then

data lines each containing information about a position in the genome (SNP/INDEL calls).

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40

http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper

SNP Calling (GATK UnifiedGenotyper)

Page 40: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

VariantFiltration is used annotate suspicious calls from VCF files based on their failing given filters. It annotates the FILTER field of VCF files for records that fail any one of several filters:

◦ Variants that lie in clusters, using the specified values to define a cluster (i.e. the number of variants and the window size).

◦ Any variant which overlaps entries from a masking rod. ◦ Any variant whose INFO field annotations match a specified expression (i.e. the expression is used to describe

records which should be filtered out).

## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar$: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta$: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod## VariantFiltration & annotation$: java –Xmx5g -jar $GATK -T VariantFiltration -R $REF -D $ROD \-o foo.gatk.VariantFiltration.snps.vcf \-B variant,VCF, foo.gatk.raw.snps.vcf \-B mask,Bed, indels.mask.bed --maskName InDel \--clusterSize 3 --clusterWindowSize 10 \--filterExpression "DP <= 8" --filterName "DP8" \--filterExpression "SB > -0.10" --filterName "StrandBias" \--filterExpression "HRun > 8" --filterName "HRun8" \--filterExpression "QD < 5.0" --filterName "QD5" \--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "hard2validate";

More information on the parameters used can be found in: http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalker http://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions

SNP Filtering & annotation (GATK VariantFiltration)

Page 41: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

VCFTOOLS: methods for working with VCF files: filtering,validating, merging, comparing and calculate some basic population genetic statistics.

Example of some basic hard filtering:

## filter poor quality & suspicious SNP calls

vcftools --vcf foo.gatk.VariantFiltration.snps.vcf \

--remove-filtered DP8 --remove-filtered StrandBias --remove-filtered LowQual \

--remove-filtered hard2validate --remove-filtered SnpCluster \

--keep-INFO AC --keep-INFO AF --keep-INFO AN --keep-INFO DB \

--keep-INFO DP --keep-INFO DS --keep-INFO Dels --keep-INFO HRun --keep-INFO HaplotypeScore --keep-INFO MQ --keep-INFO MQ0 --keep-INFO QD --keep-INFO SB --out foo.gatk.good.snps ;

this produces the file " foo.gatk.good.snps.recode.vcf"

SNP Filtering: VCFTOOLS

Page 42: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

VCFTOOLS can be used to generate useful QC measures, PLINK ped/map, Impute input and more....

## QC & info$: for MYQC in missing freq2 counts2 depth site-depth site-mean-depth geno-depth het

hardy singletons;dovcftools --vcf foo.gatk.good.snps.recode.vcf --$MYQC \--out foo.gatk.good.snps.QC;done## write genotypes, genotype qualities and genotype depth to separate files$: for MYFORMAT in GT GQ DP;dovcftools --vcf foo.gatk.good.snps.recode.vcf \--extract-FORMAT-info $MYFORMAT --out foo.gatk.good.snps;done## make PLINK ped and map filesvcftools --vcf foo.gatk.good.snps.recode.vcf --plink --out foo.gatk.good.snps

http://vcftools.sourceforge.net/

VCFTOOLS for SNP QC and statistics

Page 43: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

See : http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_fennell.pdf

Picard Tools : Post Alignment Summary reports

Page 44: Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Email : [email protected] More useful links:

◦ http://www.broadinstitute.org/gsa/wiki/index.php/Prerequisites

◦ http://www.broadinstitute.org/gsa/wiki/index.php/Building_the_GATK

◦ http://www.broadinstitute.org/gsa/wiki/index.php/Downloading_the_GATK

◦ http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK

◦ http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_the_essential_GATK_input_files:_the_reference_genome

◦ http://www.broadinstitute.org/gsa/wiki/index.php/The_DBSNP_rod

Questions?