ISB
-
Upload
joelle-gillespie -
Category
Documents
-
view
19 -
download
1
description
Transcript of ISB
ISBRavi Pandya | Bill BoloskyMicrosoft June 28 2012
Genomics project
Collaboration with UC Berkeley AMP LabDave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, …
Long term: Cancer genomicsDavid Haussler (UC Santa Cruz): Cancer Genomics Hub / Cancer Genome Atlas (TCGA)500 Tb (growing to 20 Pb) of tumor/normal genomes at San Diego Supercomputer Center
Near term: Genome sequencing pipelineMotivated by Archon Genomics X-Prize (September 2013)100 samples of DNA from centenarians (>105 years old)Sequence with best coverage, accuracy, and cost in 1 monthGoal: 98% coverage, 99.9999% accuracy, $1000/genomeCurrent tools (GATK, CLC) are not sufficient to meet the goal
Genomics pipeline
Fast, accurate, scalableApply state-of-the art computer science to sequencing problem
Machine learning, distributed systems, high-performance computing
Open source for Windows+Linux | Windows Azure cloud service
SNAP (available now)Fast aligner using hash-based index of entire genome
10-40x faster than BWA
FLASH (in progress)Comprehensive probabilistic model
Reference-based alignment + targeted de novo contig assembly + scaffold assembly
Genomics pipeline
Aligned reads Unaligned reads
Hash clustering
De novo assemblyOptimization
Scaffold assembly
Call SNPs, indels, SVs
SN
AP
FLASH
SNAP
CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCA...GTTTAGCTCAAAGAG...
Reference genome
AGCTCAAA GAAAGAA
CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCAGRead sequence
Hash index of seed {locations}
1. Lookup seeds2. Map locations3. Score matches~15 core-hours for 30x coverage
CandidateAssembly
FLASH
Candidateassembly
CoveragePair distance
Alignment
Depth
Separation
Overlap
Likelihood
Optimize
SNAP alignerGenomic prior knowledgeMachine learning models
Sparse MatricesSNAP
Read alignment
CandidateAssemblyCandidate
Assembly1
1 1
1
3B bp Genome
1B
Rea
ds
Stra
nds
0.9 0.6 0.7
0.2 0.8
3B bp Genome
1B
Rea
ds
RGS = Read-Genome-Strandcandidate assembly
LRG = Likelihood ofRead-Genome alignment
SNAP alignmentSequencing errorMutation frequencyVariant databases
Coverage distribution
AssemblyAssembly22 24 29 35 34
3B bp Genome
Stra
nds 0.1 0.12 0.14 0.12 0.1
Coverage
GSC = Genome-Strand Coverage
LC = Likelihood of Coverage
AssemblyAssemblyRGS
Sequencer characteristicsAlignment data
Hash clustering
Cluster unaligned reads with overlapping basesStarting point for assembling contigs
CGCAGCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAAC
GCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAACTGGA
CCGATCGTTTGAATTAGATGTATTAGAGGTTAGTACCCTAGCCTAGTCGTAAGA
1 1 2 3 1. Count seeds2. Bucket reads by seed3. Connect overlapping reads4. Cluster connected components
Targeted de novo assembly
Alignment
Depth
Separation
Overlap
Likelihood
Optimize
Genomic prior knowledgeMachine learning models
Contig “genome”
calc
hash
clus
ters
CandidateAssembly
Candidateassembly
CoveragePair distance
infe
r
Update
Scaffold assembly
Maximum likelihood modelOptimized reference contigs + de novo unaligned contigs
Explore space of possible arrangements into a sample genome
Optimize P(observed reads | candidate genome)
= sequencing error + coverage depth + pair distance
Incremental calculation using sparse matrix model
Next steps? …
SNAPApply to more datasets / platforms / organisms
Validate accuracy / coverage
FLASHUse Kaviar for population priors
Different approaches to assembly / structural variation
BiologyWhat interesting research could this enable – scale, speed, accuracy, analysis?