Design Goals

38
MICHAEL STRÖMBERG Boston College Data Club April 2008

description

Design Goals. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Sequencing Technologies. future. Next-Gen Sequence Lengths. Mixing It Up: Paired-end Reads. How Does It Work?. How Does It Work?. - PowerPoint PPT Presentation

Transcript of Design Goals

Page 1: Design Goals

MICHAEL STRÖMBERGBoston College Data Club

April 2008

Page 2: Design Goals
Page 3: Design Goals

Design Goals

Page 4: Design Goals

Crash Course: Reference-guided Assembly

Page 5: Design Goals

Crash Course: Reference-guided Assembly

Page 6: Design Goals

Crash Course: Reference-guided Assembly

Page 7: Design Goals

Sequencing Technologie

s

future

Page 8: Design Goals

Next-Gen Sequence Lengths

Capillary (Sanger) Roche 454 FLX0

200

400

600

800

1000

1200

1400

1600

maxmeanmin

Sequencing Technology

Sequ

ence

Len

gth

(bp)

Illumina AB SOLiD Helicos0

10

20

30

40

50

60

70

80

maxmeanmin

Sequencing Technology

Sequ

ence

Len

gth

(bp)

Page 9: Design Goals
Page 10: Design Goals

3 6 9 12 15 18 21 24 27 30 330%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Unique Genome Coverage (H. sapiens)

Sequence Length

Uni

que

Gen

ome

Cove

rage

Page 11: Design Goals

Mixing It Up: Paired-end Reads

0 50 100 150 200 250 300 3500

200400600800

10001200140016001800

fragment length (bp)

read

pai

rs (

coun

t)

Page 12: Design Goals
Page 13: Design Goals

How Does It Work?

Page 14: Design Goals

How Does It Work?

Page 15: Design Goals
Page 16: Design Goals

C. elegans: a case for INDELsSPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)Assembly time: 100 min

INDELS

INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)

Page 17: Design Goals

P. stipitis: Co-assembly

Capillary454 FLX

454 GS20

Illumina

Page 18: Design Goals

Scaling Up

Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

Project Date

Refe

renc

e Se

quen

ce L

engt

h (b

p)

C. elegans

M. musculus

H. sapiens

P. stipitis

M. musculus mtDNA

H. sapiens CAPON region

D. melanogaster

H. sapiens ENCODE region

Page 19: Design Goals

Performance: Aligners

Page 20: Design Goals

Aligners: Feature SetELAND MAQ

Newbler SHRiMP SOAP

SequencingPlatforms

Illumina454

SOLiDcapillary

Illumina IlluminaSOLiD

454 IlluminaSOLiD

Illumina

AlignmentAlgorithm

Smith-Waterma

nHash-based

Hash-based

FlowMapper

Smith-Waterma

nHash-based

Co-assemblyCreation

?

GappedAlignments ?Paired-end ReadsPlatformBinaries

Windows, Mac, Linux,

Sun, iPhone

Mac, Linux Linux Mac, Linux Mac, Linux

Page 21: Design Goals

Performance: AlignerIllumina 35 bp (X Chromosome)program aligned reads/sMOSAIK 180 - 16,658ELAND 7,716SOAP 1,637MAQ 1,376SHRIMP 39

MOSAIK (fast)

MOSAIK (single)

MOSAIK (multi)

MOSAIK (all)

ELAND MAQ SOAP SHRIMP0

2000400060008000

10000120001400016000

Page 22: Design Goals

Performance: AlignerRoche 454 FLX ~250 bpprogram aligned reads/sRoche 454 Newbler 1,176MOSAIK 317 - 616

Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.

† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

Page 23: Design Goals

Accuracy: Synthetic Data Sets

1 per 1.3 kb 1 per 7.2 kb

H. sapiens Xchromosome

1 million

Page 24: Design Goals
Page 25: Design Goals

Accuracy: Classification

MOSAIK

(fast)

MOSAIK (s

ingle)

MOSAIK (m

ulti)

MOSAIK

(all)

ELAND

MAQSO

AP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

unique readsnon-unique reads

Page 26: Design Goals

Accuracy: Unique Read Alignment

MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

readsINDELsSNPs

Page 27: Design Goals

Reasons to use ?

• FAST• Accurate• Multiprocessor (OPENMP)• Co-assemblies• Gapped alignments• Widely used

“One tool, many technologies,

many applications”

Page 28: Design Goals

(Near) Future Development• All technologies– Pacific BioSciences– Helicos

• All application areas– Adapter trimming– Coverage graphs

• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)

Page 29: Design Goals
Page 30: Design Goals

1000 Genomes Project• Many samples with light coverage

(1000 dg)– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per

population

• Trios with moderate coverage (990 dg)– 30 trios at 11x coverage

• If you’re looking for SNPs, are your tools and methods robust?

Page 31: Design Goals

Scaling Up: Disk Footprint• Current situation: files created by

MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk

speed)

• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location

Page 32: Design Goals

Scaling Up: Disk Footprint

Page 33: Design Goals

Scaling Up: Memory Footprint

• Current situation: storing the entire human genome stored with all associated hash locations

– Optimized hash table ≈ 55 GB RAM

– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file

Page 34: Design Goals

Scaling Up: Memory Footprint

Page 35: Design Goals

Scaling Up: Memory Footprint

9 10 11 12 13 14 15 16 17 1805

10152025303540455055606570

JumpDB Memory Usage (Human Genome)

JumpDB MOSAIK hash table

hash size (bp)

mem

ory

used

(G

B RA

M)

Berkeley (all positions in database)

Berkeley (1 position in database)

Jump (all positions in file-based database)

Mosaik hash table

0 4 8 12 16 20

Alignment Performance with 35bp human reads

Reads/s

Page 36: Design Goals

Scaling Up: Speed & Sensitivity

• Current situation: speed increases as the hash size increases, sensitivity decreases

• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.

• Status: Implemented but not tested.

Page 37: Design Goals

BORK! BORK! BORK!

(translated: when will MOSAIK get published?)

Page 38: Design Goals

AcknowledgementsBoston CollegeGabor MarthDerek BarnettMichele BusbyWeichun HuangAaron QuinlanChip Stewart

Thomas SeyfriedMike Kiebish

Washington University School of Medicine

Elaine MardisJarret GlasscockVincent Magrini

AgencourtDouglas SmithWei Tao