IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

24
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

description

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis. Cost. Throughput. Gigabases. Cost per Kb. Lucinda Fulton, The Genome Center at Washington University. Sequencing Technologies. http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png. - PowerPoint PPT Presentation

Transcript of IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

Page 1: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

IMGS 2012Bioinformatics Workshop:

File Formats for Next Gen Sequence Analysis

Page 2: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

19901992

19941997

19992001

20032005

20072009

0.00

10,000.00

20,000.00

30,000.00

40,000.00

50,000.00

60,000.00

70,000.00

$0.00

$20.00

$40.00

$60.00

$80.00

$100.00

$120.00

$140.00

Giga

base

s Cost per Kb

Lucinda Fulton, The Genome Center at Washington University

Cost Throughput

Page 3: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Page 4: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

Sequence “Space”• Roche 454 – Flow space

– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain

– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc

• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known

bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

Page 5: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis
Page 6: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis
Page 7: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

FlexibleGood: with rapidly changing data/tech

Poor: validationHuman Readable

Convenient for de-buggingComputer doesn’t care!

Page 8: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

SequencesFASTAFASTQSAM/BAM

AlignmentsSAM/BAMMAF

AnnotationsBEDGTFGFF3GVFVCF

http://genome.ucsc.edu/FAQ/FAQformat.html

http://www.sequenceontology.org/

Page 9: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

FASTQ

FASTA

Page 10: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

FASTQ: Data Format• FASTQ

– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence

• Line 1: begins with @; followed by sequence identifier and optional description

• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and

description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2

• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.

Sequence data format

Page 11: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

FASTQ Example

FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

For analysis, it may be necessary to convert to the Sanger form of FASTQ.

Page 12: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

FASTQ: Details• FASTQ

– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence

• Line 1: begins with @; followed by sequence identifier and optional description

• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and

description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2

• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.

Page 13: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

Phred Quality Score Probability of incorrect base call Base call accuracy

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1000 99.9 %

40 1 in 10000 99.99 %

50 1 in 100000 99.999 %

Q = Phred Quality ScoresP = Base-calling error probabilities

Quality scores

Page 14: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 

S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Format/Platform QualityScoreType ASCII encodingSanger Phred: 0-93 33-126Solexa Solexa:-5-62 64-126Illumina 1.3 Phred: 0-62 64-126Illumina 1.5 Phred: 0-62 64-126Illumina 1.8 Phred: 0-62 33-126 *** Sanger format!

Quality score encoding differ among the platforms

Most analysis tools require Sanger fastq quality score encoding

Page 15: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

http://main.g2.bx.psu.edu/

Page 16: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis
Page 17: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

SAM (Sequence Alignment/Map)

• SAM is the output of aligners that map reads to a reference genome– Tab delimited w/ header section and alignment

section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields

– BAM is the binary format of SAM

http://samtools.sourceforge.net/

Alignment data format

Page 18: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

http://samtools.sourceforge.net/SAM1.pdf

Mandatory Alignment Fields

Page 19: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples

Alignments in SAM format

CIGAR string -> 8M2I4M1D3M

Page 20: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

Annotation Formats• Mostly tab delimited files that describe the location of

genome features (i.e., genes, etc.)• Also used for displaying annotations on standard genome

browsers • Important for associating alignments with specific genome

features• descriptions• Knowing format details can be important to translating

results!– BED is zero based– GTF/GFF are one based

Page 21: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

GTF

http://useast.ensembl.org/info/website/upload/gff.html

Annotation data format

Page 22: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171

chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +

BED formatAnnotation data format

Page 23: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

BED: zero based, start inclusive, stop exclusive

GTF/GFF: one based, inclusive

Length = stop-start

Length = stop-start+1

Page 24: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

GRCh37

NCBI36