Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

33
Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group

Transcript of Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Page 1: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Next Generation Sequencing BioinformaticsNext Generation Sequencing Bioinformatics

Stephen TaylorStephen Taylor

Computational Biology Research Group

Page 2: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

HistoryHistory

• Sanger • Dominant for last ~30 years

• 1000bp longest read

• Based on primers so not good for repetitive or SNPs sites

• Next Generation Sequencing• Much shorter reads, 25 to 300 bp

• Higher throughput

• Cheaper cost per Mb

• Single molecule sequencing (no cloning step)

• Since Jan 2008 more DNA sequenced than all previous years

• Sanger • Dominant for last ~30 years

• 1000bp longest read

• Based on primers so not good for repetitive or SNPs sites

• Next Generation Sequencing• Much shorter reads, 25 to 300 bp

• Higher throughput

• Cheaper cost per Mb

• Single molecule sequencing (no cloning step)

• Since Jan 2008 more DNA sequenced than all previous years

Computational Biology Research Group

Page 3: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Hence We Need High Throughput Bioinformatics

Hence We Need High Throughput Bioinformatics

Computational Biology Research Group

Page 4: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

SangerSanger

• Fred Sanger (1980)• Dye-terminator sequencing• PCR up DNA fragment• Separate into 2 strands• Polymerase elongates DNA• Incorporation of fluorescence labelled ddNTP causes

termination of elongation for each base • Run DNA fragments on gel/capillary• Peak generated for each base

• Fred Sanger (1980)• Dye-terminator sequencing• PCR up DNA fragment• Separate into 2 strands• Polymerase elongates DNA• Incorporation of fluorescence labelled ddNTP causes

termination of elongation for each base • Run DNA fragments on gel/capillary• Peak generated for each base

Computational Biology Research Group

Page 5: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Illumina (Solexa)Illumina (Solexa)

Computational Biology Research Group

Page 6: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Illumina (Solexa)Illumina (Solexa)

Computational Biology Research Group

Page 7: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Illumina (Solexa)Illumina (Solexa)

Computational Biology Research Group

Page 8: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Illumina (Solexa) ApplicationsIllumina (Solexa) Applications

Resequencing• Characterise different related species or strains

Transcriptome analysis • No chip/array required!

• random priming of RNA

DNA methylation analysis• sequencing bisulfite-converted DNA methylation-sensitive restriction

digest enriched fragments

Examine chromatin modifications• Quantify in vivo protein-DNA interactions using the combination of

chromatin immunoprecipitation and sequencing (ChIP-Seq)

Resequencing• Characterise different related species or strains

Transcriptome analysis • No chip/array required!

• random priming of RNA

DNA methylation analysis• sequencing bisulfite-converted DNA methylation-sensitive restriction

digest enriched fragments

Examine chromatin modifications• Quantify in vivo protein-DNA interactions using the combination of

chromatin immunoprecipitation and sequencing (ChIP-Seq)

Computational Biology Research Group

Page 9: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Price ComparisonPrice Comparison

Computational Biology Research Group

Page 10: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Processing and managementProcessing and management

Computational Biology Research Group

Page 11: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Assemble Data - IlluminaAssemble Data - Illumina

Generates short reads (~35-75bp)

Good for resequencing

Difficult to do de novo assembly all but smallest organisms

Generates short reads (~35-75bp)

Good for resequencing

Difficult to do de novo assembly all but smallest organisms

Computational Biology Research Group

Page 12: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Mapping Illumina Reads Mapping Illumina Reads

• Acquire and process images and convert to FASTQ*• Get data• Quality control**• Map to genome• Visualisation• Post Processing

• Peak Finding

• SNP Calling

* Not covered today

• Acquire and process images and convert to FASTQ*• Get data• Quality control**• Map to genome• Visualisation• Post Processing

• Peak Finding

• SNP Calling

* Not covered today

Computational Biology Research Group

Page 13: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

FASTQ formatFASTQ format

@HWUSI-EAS100R:6:73:941:1973#0/1

TATACAATGCACTTAGTCATCCGCGTATCACTTTAT

+

IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I

1. HWUSI-EAS100R the unique instrument name

2. 6 flowcell lane

3. 73 tile number within the flowcell lane

4. 941 'x'-coordinate of the cluster within the tile

5. 1973 'y'-coordinate of the cluster within the tile

6. #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

@HWUSI-EAS100R:6:73:941:1973#0/1

TATACAATGCACTTAGTCATCCGCGTATCACTTTAT

+

IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I

1. HWUSI-EAS100R the unique instrument name

2. 6 flowcell lane

3. 73 tile number within the flowcell lane

4. 941 'x'-coordinate of the cluster within the tile

5. 1973 'y'-coordinate of the cluster within the tile

6. #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Computational Biology Research Group

Page 14: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

FASTQ formatFASTQ format

Quality Score

ASCII representation of score for each base e.g. I

Convert to ASCII e.g. 73

Minus <a value>

Original Qphred= 40

See http://en.wikipedia.org/wiki/FASTQ_format

Quality Score

ASCII representation of score for each base e.g. I

Convert to ASCII e.g. 73

Minus <a value>

Original Qphred= 40

See http://en.wikipedia.org/wiki/FASTQ_format

Computational Biology Research Group

Page 15: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Formats – warning!Formats – warning!

FASTQ format appears ‘standard’ but there are 3 types based on the probabilities of the base calls…

Qphred = -10 x log10(error_prob)

Qsolexa = -10 x log10(error_prob/(1-error_prob))

1. Standard fastq: ASCII( Qphred + 33 )

2. Illumina pre v1.3 : ASCII( Qsolexa + 64 )

3. Illumina post v1.3: ASCII( Qphred+64 )

Option 3 should be the main one for the forseeable future!

FASTQ format appears ‘standard’ but there are 3 types based on the probabilities of the base calls…

Qphred = -10 x log10(error_prob)

Qsolexa = -10 x log10(error_prob/(1-error_prob))

1. Standard fastq: ASCII( Qphred + 33 )

2. Illumina pre v1.3 : ASCII( Qsolexa + 64 )

3. Illumina post v1.3: ASCII( Qphred+64 )

Option 3 should be the main one for the forseeable future!

Computational Biology Research Group

Page 16: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Convert between formatsConvert between formats

Computational Biology Research Group

Use sol2std2

Page 17: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Get DataGet Data

May be supplied in a variety of formats

.prb .txt files • Contain probabilities for each base• Some SNP callers use this• Usually convert to FASTQ

FASTQ• Like FASTA but with quality score associated with each base

May be supplied in a variety of formats

.prb .txt files • Contain probabilities for each base• Some SNP callers use this• Usually convert to FASTQ

FASTQ• Like FASTA but with quality score associated with each base

Computational Biology Research Group

Page 18: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

WTCHGWTCHG

• If data is from WTCHG likely to get an email• E.g. http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/• wget the FASTQ file in the GERALD directory• http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/

GERALD_24-09-2009_johnb/s_2_sequence.txt.gz

• If data is from WTCHG likely to get an email• E.g. http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/• wget the FASTQ file in the GERALD directory• http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/

GERALD_24-09-2009_johnb/s_2_sequence.txt.gz

Computational Biology Research Group

Page 19: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Processing reads - IlluminaProcessing reads - Illumina

Mapping Tools• MAQ

• Sanger

• Uses quality scores

• ELAND

• Comes with the machine and runs as standard

• Very fast

• NOVOALIGN

• Slower, more accurate

• Output option includes pairwise (handy for following up SNP calls)

• TOPHAT

• For RNA-Seq

• Can map slice junctions

Mapping Tools• MAQ

• Sanger

• Uses quality scores

• ELAND

• Comes with the machine and runs as standard

• Very fast

• NOVOALIGN

• Slower, more accurate

• Output option includes pairwise (handy for following up SNP calls)

• TOPHAT

• For RNA-Seq

• Can map slice junctions

Computational Biology Research Group

Page 20: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Notes on MappingNotes on Mapping

• What genome?• Masking?• Some tools disregard multiple maps e.g. ELAND• Some tools map to one location and adjust probability score

e.g. MAQ• Can be confusing…• For ChIP-Seq we normally use DNA heavily masked for repeats

(simple/complex/ribosomal)

• What genome?• Masking?• Some tools disregard multiple maps e.g. ELAND• Some tools map to one location and adjust probability score

e.g. MAQ• Can be confusing…• For ChIP-Seq we normally use DNA heavily masked for repeats

(simple/complex/ribosomal)

Computational Biology Research Group

Page 21: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Databanks IndicesDatabanks Indices

• We have many indexed databanks• Under /databank/indices/<tool> e.g. for maq

• ens_human_chrs/ • ens_human_chrs_ucsc_rmfull_2/ • ens_mouse_chrs/ • ens_mouse_chrs_ucsc_rmfull/• ens_human_cdna/ • ens_mouse_masked_chrs/

• Indices for both maq and novoalign• If an index you need is not there please ask – don’t make a

local one in your account!

• We have many indexed databanks• Under /databank/indices/<tool> e.g. for maq

• ens_human_chrs/ • ens_human_chrs_ucsc_rmfull_2/ • ens_mouse_chrs/ • ens_mouse_chrs_ucsc_rmfull/• ens_human_cdna/ • ens_mouse_masked_chrs/

• Indices for both maq and novoalign• If an index you need is not there please ask – don’t make a

local one in your account!

Computational Biology Research Group

Page 22: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

ChIP-Seq PipelineChIP-Seq Pipeline

Computational Biology Research Group

ChIP-Sequencing Advantages Less DNA needed Not limited by micro-array content More precise site mapping Increased reads increases sensitivity Produces higher quality data

Page 23: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

ChIP-Seq exampleChIP-Seq example

NGSreads

Map (maq)

Peak pick

(cisgenome)

Extract sequences from features(Motif extract)

MEME

Weblogo

Page 24: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

MAQMAQ

For simple runs use ‘easyrun’ option…

nohup /proj/hts/bin/maq.pl easyrun <db> <fastq> -d <results-directory > maq.log

In <results-directory> the main file is all.map

To see the binary to something usable:

maq pileup <db> all.map > all.pileup

These are quite large files…

For simple runs use ‘easyrun’ option…

nohup /proj/hts/bin/maq.pl easyrun <db> <fastq> -d <results-directory > maq.log

In <results-directory> the main file is all.map

To see the binary to something usable:

maq pileup <db> all.map > all.pileup

These are quite large files…

Computational Biology Research Group

Page 25: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

VisualizationVisualization

all.map file converts to wig using CBRG custom tool

maq wig <db> all.map > all.wig

Then we convert to GFF format using custom scripts

all.map file converts to wig using CBRG custom tool

maq wig <db> all.map > all.wig

Then we convert to GFF format using custom scripts

Computational Biology Research Group

Page 26: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

GFF formatGFF format

• Gene Feature Format• Developed at the Sanger Institute• http://www.sanger.ac.uk/Software/formats/GFF/• Format for describing features associated with DNA, RNA and

Protein sequences• Easy to parse• More tools e.g. EMBOSS starting to use this as standard• GFF3 is more standard and works best with GBrowse

• Gene Feature Format• Developed at the Sanger Institute• http://www.sanger.ac.uk/Software/formats/GFF/• Format for describing features associated with DNA, RNA and

Protein sequences• Easy to parse• More tools e.g. EMBOSS starting to use this as standard• GFF3 is more standard and works best with GBrowse

Computational Biology Research Group

Page 27: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Computational Biology Research Group

##gff-version 3

chr3 src exon 1300 1500 . + . ID=exon00001

chr3 src exon 1050 1500 . + . ID=exon00002

chr3 src exon 3000 3902 . + . ID=exon00003

chr3 src exon 5000 5500 . + . ID=exon00004

chr3 src exon 7000 9000 . + . ID=exon00005

##gff-version 3

chr3 src exon 1300 1500 . + . ID=exon00001

chr3 src exon 1050 1500 . + . ID=exon00002

chr3 src exon 3000 3902 . + . ID=exon00003

chr3 src exon 5000 5500 . + . ID=exon00004

chr3 src exon 7000 9000 . + . ID=exon00005

SOFA term Note ‘=‘

http://gmod.org/wiki/GFF3

Page 28: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Wig binary filesWig binary filesScripts and modules to handle :

UCSC wiggle format (1 column; 2 column; 4 column)

or, gff3

binary (.wib)GMOD script

wiggle_to_wigBinary.pl gff file

Function:

wiggle_to_wigBinary.pl variables

(source / method / trackname / paths / input & output filenames )

command line to load binary / gff data into GBrowse (bp_seqfeature_load.pl + all variables: database name, filenames, paths etc)

a conf file stanza - to display the loaded data

construct an intermediate wiggle format file (....if input was gff3, maq binary)

Page 29: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Peak CallingPeak Calling

• Lots of algorithms to do this• Problems with identifying a good cut off score• Over and under prediction

F-Seq• Based on a training set of peaks identified by researcher in specific

region

• Iterate over parameter space until achieve best TP/FP score

cisgenome• Uses IP and Non IP ChIP-Seq data, increases accuracy of predictions

• Lots of algorithms to do this• Problems with identifying a good cut off score• Over and under prediction

F-Seq• Based on a training set of peaks identified by researcher in specific

region

• Iterate over parameter space until achieve best TP/FP score

cisgenome• Uses IP and Non IP ChIP-Seq data, increases accuracy of predictions

Computational Biology Research Group

Page 30: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Motif ExtractionMotif Extraction

• Extract underlying DNA from peak calls• Run using web based motif finders

• Weeder• MEME

• May need to do successive rounds to find weaker motifs

• Extract underlying DNA from peak calls• Run using web based motif finders

• Weeder• MEME

• May need to do successive rounds to find weaker motifs

Computational Biology Research Group

Page 31: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Quick note: SNP CallingQuick note: SNP Calling

Often finds errors in the PCR amplication step

• maq cns2snp (run during the easyrun option)

• SNPseeker

• Novoalign + CBRG script

Worth trying all of the above!

Often finds errors in the PCR amplication step

• maq cns2snp (run during the easyrun option)

• SNPseeker

• Novoalign + CBRG script

Worth trying all of the above!

Computational Biology Research Group

Page 32: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Molbiol Data StructureMolbiol Data Structure

Analyse your data on deva.molbiol.ox.ac.uk

CBRG set up /proj/hts/data/<username>

Suggested structure:

batch/

fastq/ dbname/

Contact us if you want a GBrowse database for your data

Analyse your data on deva.molbiol.ox.ac.uk

CBRG set up /proj/hts/data/<username>

Suggested structure:

batch/

fastq/ dbname/

Contact us if you want a GBrowse database for your data

Computational Biology Research Group

Page 33: Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

FutureFuture

Problem• In depth analysis after mapping = bottleneck• Need to empower the users to do their own analysis

Solution• Makefiles for bulk data analysis• Allow access to NGS data via GBrowse ‘workbench’• GBrowse plugins to export data to other tools• Galaxy http://main.g2.bx.psu.edu/ looks promising

Problem• In depth analysis after mapping = bottleneck• Need to empower the users to do their own analysis

Solution• Makefiles for bulk data analysis• Allow access to NGS data via GBrowse ‘workbench’• GBrowse plugins to export data to other tools• Galaxy http://main.g2.bx.psu.edu/ looks promising

Computational Biology Research Group