David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting...

26
David Patterson July 16, 2013

Transcript of David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting...

Page 1: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

David Patterson July 16, 2013

Page 2: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and
Page 3: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

Adaptive/Active

Machine Learning

and Analytics

Cloud Computing

CrowdSourcing/

Human

Computation

Massive

and

Diverse

Data

• 2011-2017

• Machine Learning,

Databases, Systems,

+ Networking

• Release Berkeley

Data Analysis Stack

(BDAS)

Page 4: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and
Page 5: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

MLBase (Declarative Machine Learning) BlinkDB (approx QP)

HDFS

Shark (SQL) + Streaming

AMPLab (released) 3rd party AMPLab (in progress)

Streaming

Hadoop MR

MPI

Graphlab

etc.

Spark

Shared RDDs (distributed memory) Mesos (cluster resource manager)

Page 6: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

$0.1

$1.0

$10.0

$100.0

$1,000.0

$10,000.0

$100,000.0

2001 - 2014

$K per genome

Page 7: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and
Page 8: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

UC Students/

Post-Docs – Ma’ayan Bresler

– Kristal Curtis

– Jesse Liptrap

– Ameet Talwalkar

– Jonathan Terhorst

– Matei Zaharia

– Yuchen Zhang

UC Faculty – Michael Jordan

– David Patterson

– Satish Rao

– Scott Shenker

– Yun Song

– Ion Stoica

Expertise – Computational Biology/Medicine

– Machine Learning

– Systems

External – Bill Bolosky (MS/MSR)

– Christopher Hartl (Broad)

– Mishali Naik (Intel)

– Paolo Narvaez (Intel)

– Ravi Pandya (MS)

– Abirami Prabhakaran (Intel)

– Taylor Sittler (UCSF)

– Gans Srinivasa (Intel)

– Arun Wiita (UCSF)

Collaborators

– David Haussler, UCSC

– Gaddy Getz, The Broad

– Mark Depristo, The Broad

Page 9: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

“Computational science: ...Error…why scientific programming does

not compute,” by Zeeya Merali, 13 October 2010, Nature 467, 775-777

Page 10: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

http://snap.cs.berkeley.edu/

Page 11: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

SNAP 2.5

Novo 33.5

Bowtie2 7.5

BWA 28.0

http://snap.cs.berkeley.edu/

Novo

SNAP

Bowtie2

BWA

Error Rate

% A

lig

ned

Page 12: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

Step # of Threads Runtime (hours)

Read Alignment 16 16

Sampe 1 59

Import 1 19

Sort + Index 1 25

MarkDuplicates + Index 1 21

UnifiedGenotyper* 16 12

SomaticIndelDetector 1 5.5

RealignerTargetCreator 16 1.5

IndelRealigner* + Index 1 32.5

BaseRecalibrator* 1 96

PrintReads* + Index + Flagstat 1 44.5

TOTAL (hours) 332

# of Threads/Tasks Improved Runtime (hours)

16 16 16 12 1 19 1 25 1 21 16 12 1 5.5 16 1.5 64 16 64 15 64 26

169

Introduced Cluster-level

Parallelism

Introduced Thread-level

Parallelism

SNAP+ does in 10 hours

vs. 93 hours

Page 13: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

1. Create easy-to-use, fast, accurate, reliable genetic analysis software pipelines

Page 14: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

Emperor of All Maladies,

page 464

Page 15: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

(UCSF)

(UCSC)

(UCSC)

(UCSC) (UCSC)

Page 16: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

• create and maintain the interoperability of technology platform standards

for managing and sharing genomic data in clinical samples;

• develop guidelines and harmonizing procedures for privacy and ethics in

the international regulatory context;

• engage stakeholders across sectors to encourage the responsible and

voluntary sharing of data and of methods.

Global Alliance to Enable Responsible Sharing of Genomic and Clinical Data

Page 17: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

Sequence graphs Interpretation Layer

Variation Layer

Read Layer

Page 18: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and
Page 19: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and
Page 20: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

1. Create easy-to-use, fast, accurate, reliable genetic analysis software pipelines

2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and outcomes

Page 21: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

21

If you cannot measure it,

you cannot improve it.

- Lord Kelvin

Page 22: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

22

Page 23: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

23

1. Reference

2. Sample

3. Validation

1. Reference

2. Sample

3. Validation

1. Reference

2. Sample

3. Validation

Page 24: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

24

Precision Recall

GATK mpileup GATK mpileup

Synthetic 98.3±0.0 96.8±0.0 91.3±0.00 96.9±0.00

Mouse 98.6±0.3 98.4±0.3 89.7±0.20 87.6±0.20

Sampled Human

pending pending 98.4±0.04 80.9±0.04

Time/cost on AWS EC2 (cc2.8xlarge, 16 cores, 60GB RAM)

$/Genome Hours/Genome

GATK mpileup GATK mpileup

Synthetic $67 $5 28 2

Mouse $72 $17 30 7

Sampled Human $192 $22 80 9

Page 25: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and
Page 26: David Pattersonanalysis software pipelines 2. Create massive, cheap, easy-to-use, privacy-protecting repository for cancer treatments showing tumor genomes over time, therapies, and

chance

aren’t we obligated to try

26