ISB

12
ISB Ravi Pandya | Bill Bolosky Microsoft June 28 2012

description

ISB. Ravi Pandya | Bill Bolosky Microsoft June 28 2012. Genomics project. Collaboration with UC Berkeley AMP Lab Dave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, … Long term: Cancer genomics - PowerPoint PPT Presentation

Transcript of ISB

Page 1: ISB

ISBRavi Pandya | Bill BoloskyMicrosoft June 28 2012

Page 2: ISB

Genomics project

Collaboration with UC Berkeley AMP LabDave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, …

Long term: Cancer genomicsDavid Haussler (UC Santa Cruz): Cancer Genomics Hub / Cancer Genome Atlas (TCGA)500 Tb (growing to 20 Pb) of tumor/normal genomes at San Diego Supercomputer Center

Near term: Genome sequencing pipelineMotivated by Archon Genomics X-Prize (September 2013)100 samples of DNA from centenarians (>105 years old)Sequence with best coverage, accuracy, and cost in 1 monthGoal: 98% coverage, 99.9999% accuracy, $1000/genomeCurrent tools (GATK, CLC) are not sufficient to meet the goal

Page 3: ISB

Genomics pipeline

Fast, accurate, scalableApply state-of-the art computer science to sequencing problem

Machine learning, distributed systems, high-performance computing

Open source for Windows+Linux | Windows Azure cloud service

SNAP (available now)Fast aligner using hash-based index of entire genome

10-40x faster than BWA

FLASH (in progress)Comprehensive probabilistic model

Reference-based alignment + targeted de novo contig assembly + scaffold assembly

Page 4: ISB

Genomics pipeline

Aligned reads Unaligned reads

Hash clustering

De novo assemblyOptimization

Scaffold assembly

Call SNPs, indels, SVs

SN

AP

FLASH

Page 5: ISB

SNAP

CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCA...GTTTAGCTCAAAGAG...

Reference genome

AGCTCAAA GAAAGAA

CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCAGRead sequence

Hash index of seed {locations}

1. Lookup seeds2. Map locations3. Score matches~15 core-hours for 30x coverage

Page 6: ISB

CandidateAssembly

FLASH

Candidateassembly

CoveragePair distance

Alignment

Depth

Separation

Overlap

Likelihood

Optimize

SNAP alignerGenomic prior knowledgeMachine learning models

Sparse MatricesSNAP

Page 7: ISB

Read alignment

CandidateAssemblyCandidate

Assembly1

1 1

1

3B bp Genome

1B

Rea

ds

Stra

nds

0.9 0.6 0.7

0.2 0.8

3B bp Genome

1B

Rea

ds

RGS = Read-Genome-Strandcandidate assembly

LRG = Likelihood ofRead-Genome alignment

SNAP alignmentSequencing errorMutation frequencyVariant databases

Page 8: ISB

Coverage distribution

AssemblyAssembly22 24 29 35 34

3B bp Genome

Stra

nds 0.1 0.12 0.14 0.12 0.1

Coverage

GSC = Genome-Strand Coverage

LC = Likelihood of Coverage

AssemblyAssemblyRGS

Sequencer characteristicsAlignment data

Page 9: ISB

Hash clustering

Cluster unaligned reads with overlapping basesStarting point for assembling contigs

CGCAGCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAAC

GCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAACTGGA

CCGATCGTTTGAATTAGATGTATTAGAGGTTAGTACCCTAGCCTAGTCGTAAGA

1 1 2 3 1. Count seeds2. Bucket reads by seed3. Connect overlapping reads4. Cluster connected components

Page 10: ISB

Targeted de novo assembly

Alignment

Depth

Separation

Overlap

Likelihood

Optimize

Genomic prior knowledgeMachine learning models

Contig “genome”

calc

hash

clus

ters

CandidateAssembly

Candidateassembly

CoveragePair distance

infe

r

Update

Page 11: ISB

Scaffold assembly

Maximum likelihood modelOptimized reference contigs + de novo unaligned contigs

Explore space of possible arrangements into a sample genome

Optimize P(observed reads | candidate genome)

= sequencing error + coverage depth + pair distance

Incremental calculation using sparse matrix model

Page 12: ISB

Next steps? …

SNAPApply to more datasets / platforms / organisms

Validate accuracy / coverage

FLASHUse Kaviar for population priors

Different approaches to assembly / structural variation

BiologyWhat interesting research could this enable – scale, speed, accuracy, analysis?