Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics...

37
Persona: A High-Performance Bioinformatics Framework Stuart Byma 1 , Sam Whitlock 1 , Laura Flueratoru 2 , Ethan Tseng 3 , Christos Kozyrakis 4 , Edouard Bugnion 1 , James Larus 1 EPFL 1 , U. Polytehnica of Bucharest 2 , CMU 3 , Stanford 4 1 12/07/2017

Transcript of Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics...

Page 1: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Persona:AHigh-PerformanceBioinformaticsFramework

StuartByma1,SamWhitlock1,LauraFlueratoru2,EthanTseng3,ChristosKozyrakis4,EdouardBugnion1,JamesLarus1

EPFL1,U.Polytehnica ofBucharest2,CMU3,Stanford4

112/07/2017

Page 2: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Agenda

• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine

• PerformanceResults

2

Page 3: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Sequencingcost

3

Notawetlabproblemanymoreà IT/Systemsproblem

Page 4: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Implications

4

~300GB ~hours

Needefficientsystemsthatscalewell

?

Page 5: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Agenda

• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine

• PerformanceResults

5

~300GB ~hours

Page 6: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Whatkindofdata?

• CommonsequencersproduceReads• SnippetsofDNAà AACCGCTAGCGCGCTAGCTCGAGCTAGAA• 100-200bases

6

@sequence name, metadataACGTTTCGATCGCGCCAGGAGGCTAG+-+*''))**55CCF@>>>>>CCCCCCtimesafewhundredmillion…

Page 7: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Alignment

...TGACCTATAGCGATATAGCTTATTATTGGG-CAAAAATGGAATCGATTGATCG...|||||||||| ||||| |||TATTATTGGGATAAAA-TGG

ReferenceGenome

Read:

Insertion Deletion

Mismatch

7

~hours

timesafewhundredmillion…

Page 8: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

AlignedReads

• StoredinSAM/BAM

read_name 16 chr12 85500011 70 18M * 0 0 TTTTACACACATTATCTC CDDFAEEC>EDDFFBCDEED?FCC@ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm

• Followedby• Duplicatemarking• Sorting• Recalibrations,analysis(variantcalling)

8

~hours

Page 9: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

DataandToolIssues

9

FASTQSAM/BAMBEDVCF

Page 10: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Persona– Bioinformatics,Unified

10

AggregateGenomicData

Page 11: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Agenda

• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine

• PerformanceResults

11

Page 12: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

AggregateGenomicData

12

Header

Index

Data compressed

Manifest

StorageSubsystem

BasesQ-ScoresMetadata

Page 13: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Agenda

• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine

• PerformanceResults

13

Page 14: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

AGDChunks

14

Dataflow

• Dataflowexecutionframework• BaseonTensorFlow engine• Butnomachinelearning

• OperatorsperformcomputationonAGDchunks

Page 15: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Dataflow AGDChunks

15

• Modularity• Balance/tuning• (bounded)Queueing

Page 16: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Fine-grainedThreading

• AGDchunksoptimizedforstorage• Toocoarseforsometasks

• Splitintosubchunks• Delegatetoexecutor sharedresource• Taskqueue+threadpool Aligners

NotifyAGDBuf

ThreadPool16

Page 17: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

AlignerGraph GetChunk

Decompress/Parse

Compress

PutChunk

AGDChunks

AlignmentExecutor

17

AlignReads

Page 18: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

GraphConstruction

18

c = persona.read_chunk(path)

d = persona.decompress(c)

o = persona.align(d)

sess = tf.Session()result = sess.run([o])

Page 19: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

PersonaShell

PersonaShell

align sort import

localruntime dist runtime

$ persona align local –i hg19 data/my_agd.json$ persona sort local data/my_agd.json

19

Page 20: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

DistributedComputation

Server1

Server0

QueueService

Client$ persona client bwa-align

20

Server1 ServerN

StorageSubsystem

Page 21: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

CurrentFeatures

• ImportdatafromFASTQ/BAM/SRA,exporttoBAM• SequencealignmentwithBWA-MEM,SNAP• Datasetsorting• Duplicatemarking• Datasetstatistics(samtools flagstat)• Readcoverage(depth)

21

Page 22: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Agenda

• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine

• PerformanceResults

22

Page 23: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Evaluation-- Setup

• FocusedonsequencealignmentusingSNAP• Throughputinbasesalignedpersecond• Data• 223million101basereads(~16GB)• AGDchunksof100Krecords

• Hardware• 32XUbuntu16.04,[email protected]• Dataon6-diskRAID0andsinglespindledrive• 7serverCephobjectstorefordistributedexecution

23

Page 24: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Evaluation-- Questions

• Whatarethebandwidth-savingeffectsofAGD?

• WhatistheoverheadofthePersonaframework?

• HowwelldoPersonaandAGDscale?

24

Page 25: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Performance– AGD

25SignificantlylessI/OàmoreefficientuseofHW,BW

SNAP

PersonaSNAP

*singledisk

Page 26: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

PersonaOverhead

26Negligibleoverhead!

*RAID-0

Page 27: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Scaling

27Fulldatasetalignedin~17seconds

Page 28: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

ScalingLimits

28

Page 29: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Persona– ScalableBioinformatics

29

AggregateGenomicData

https://github.com/epfl-vlsc/persona

Page 30: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

backup

30

Page 31: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Performance– SortandDup.Mark

• Sort• Bymetadataoralignedlocation• 1.54xspeedupoversamtools• 5.15xspeedupoverPicard

• Datasetstats• 2xspeedup

• Duplicatemarking• Samealgorithmassamblaster• 3.73xfasterthansamblaster

• Coverage(depth)• 2xspeedup

31

Page 32: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Profiling

32

Page 33: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Read/WriteSingleDisk

33

Page 34: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Alignment

• Example:SNAP• Buildhashindexofreference• Toalignaread:• Hashaportion(seed)• Lookup

• Evaluateeachhit• Editdistancecomputation

• Coresalignreadsinparallel

TATTATTGGGATAAAATGGTTT

...TATTACTGGGCAAAAATGGTTTATG.............

ReferenceGenomeIndex(40GB)

TATTATTGGGATAAAATGGTTT

editdistance

34

Page 35: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

SharedData

• Sometimesneedtosharedatabetweenops• E.g.multi-GBindexofreferencegenome

• UseTFsessionresourcemanager• [string,string]à refcount object

• Opcancreateobjects,providehandletootherops

ResourceManager

LookupOrCreate()[c,n]

Lookup()

ProviderOp

ConsumerOp

35

Page 36: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

DataMovement

• Tensorsnotamenabletobioinfo data• LeverageTFsharedresources• Implementreusablebuffers• Stablememoryuse• Avoidsyscalls

BufferPoolOp

[container,name]

Pool

36

Page 37: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos

Bioinformatics?

• Biology,computerscience,math,statistics• Startedmid90’swithHumanGenomeProject• Broadfield• Genomics,proteomics,systemsbiology

• Thistalk:WholeGenomeSequence(WGS)analysis• ReadingthelettersofyourDNA(ATCG…)

37