Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics...
Transcript of Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics...
Persona:AHigh-PerformanceBioinformaticsFramework
StuartByma1,SamWhitlock1,LauraFlueratoru2,EthanTseng3,ChristosKozyrakis4,EdouardBugnion1,JamesLarus1
EPFL1,U.Polytehnica ofBucharest2,CMU3,Stanford4
112/07/2017
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
2
Sequencingcost
3
Notawetlabproblemanymoreà IT/Systemsproblem
Implications
4
~300GB ~hours
Needefficientsystemsthatscalewell
?
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
5
~300GB ~hours
Whatkindofdata?
• CommonsequencersproduceReads• SnippetsofDNAà AACCGCTAGCGCGCTAGCTCGAGCTAGAA• 100-200bases
6
@sequence name, metadataACGTTTCGATCGCGCCAGGAGGCTAG+-+*''))**55CCF@>>>>>CCCCCCtimesafewhundredmillion…
Alignment
...TGACCTATAGCGATATAGCTTATTATTGGG-CAAAAATGGAATCGATTGATCG...|||||||||| ||||| |||TATTATTGGGATAAAA-TGG
ReferenceGenome
Read:
Insertion Deletion
Mismatch
7
~hours
timesafewhundredmillion…
AlignedReads
• StoredinSAM/BAM
read_name 16 chr12 85500011 70 18M * 0 0 TTTTACACACATTATCTC CDDFAEEC>EDDFFBCDEED?FCC@ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm
• Followedby• Duplicatemarking• Sorting• Recalibrations,analysis(variantcalling)
8
~hours
DataandToolIssues
9
…
FASTQSAM/BAMBEDVCF
…
Persona– Bioinformatics,Unified
10
AggregateGenomicData
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
11
AggregateGenomicData
12
Header
Index
Data compressed
Manifest
StorageSubsystem
BasesQ-ScoresMetadata
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
13
AGDChunks
14
Dataflow
• Dataflowexecutionframework• BaseonTensorFlow engine• Butnomachinelearning
• OperatorsperformcomputationonAGDchunks
Dataflow AGDChunks
15
• Modularity• Balance/tuning• (bounded)Queueing
Fine-grainedThreading
• AGDchunksoptimizedforstorage• Toocoarseforsometasks
• Splitintosubchunks• Delegatetoexecutor sharedresource• Taskqueue+threadpool Aligners
NotifyAGDBuf
ThreadPool16
AlignerGraph GetChunk
Decompress/Parse
Compress
PutChunk
AGDChunks
AlignmentExecutor
17
AlignReads
GraphConstruction
18
c = persona.read_chunk(path)
d = persona.decompress(c)
o = persona.align(d)
sess = tf.Session()result = sess.run([o])
PersonaShell
PersonaShell
align sort import
localruntime dist runtime
$ persona align local –i hg19 data/my_agd.json$ persona sort local data/my_agd.json
19
…
DistributedComputation
Server1
Server0
QueueService
Client$ persona client bwa-align
20
Server1 ServerN
StorageSubsystem
CurrentFeatures
• ImportdatafromFASTQ/BAM/SRA,exporttoBAM• SequencealignmentwithBWA-MEM,SNAP• Datasetsorting• Duplicatemarking• Datasetstatistics(samtools flagstat)• Readcoverage(depth)
21
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
22
Evaluation-- Setup
• FocusedonsequencealignmentusingSNAP• Throughputinbasesalignedpersecond• Data• 223million101basereads(~16GB)• AGDchunksof100Krecords
• Hardware• 32XUbuntu16.04,[email protected]• Dataon6-diskRAID0andsinglespindledrive• 7serverCephobjectstorefordistributedexecution
23
Evaluation-- Questions
• Whatarethebandwidth-savingeffectsofAGD?
• WhatistheoverheadofthePersonaframework?
• HowwelldoPersonaandAGDscale?
24
Performance– AGD
25SignificantlylessI/OàmoreefficientuseofHW,BW
SNAP
PersonaSNAP
*singledisk
PersonaOverhead
26Negligibleoverhead!
*RAID-0
Scaling
27Fulldatasetalignedin~17seconds
ScalingLimits
28
Persona– ScalableBioinformatics
29
AggregateGenomicData
…
https://github.com/epfl-vlsc/persona
backup
30
Performance– SortandDup.Mark
• Sort• Bymetadataoralignedlocation• 1.54xspeedupoversamtools• 5.15xspeedupoverPicard
• Datasetstats• 2xspeedup
• Duplicatemarking• Samealgorithmassamblaster• 3.73xfasterthansamblaster
• Coverage(depth)• 2xspeedup
31
Profiling
32
Read/WriteSingleDisk
33
Alignment
• Example:SNAP• Buildhashindexofreference• Toalignaread:• Hashaportion(seed)• Lookup
• Evaluateeachhit• Editdistancecomputation
• Coresalignreadsinparallel
TATTATTGGGATAAAATGGTTT
...TATTACTGGGCAAAAATGGTTTATG.............
ReferenceGenomeIndex(40GB)
TATTATTGGGATAAAATGGTTT
editdistance
34
SharedData
• Sometimesneedtosharedatabetweenops• E.g.multi-GBindexofreferencegenome
• UseTFsessionresourcemanager• [string,string]à refcount object
• Opcancreateobjects,providehandletootherops
ResourceManager
LookupOrCreate()[c,n]
Lookup()
ProviderOp
ConsumerOp
35
DataMovement
• Tensorsnotamenabletobioinfo data• LeverageTFsharedresources• Implementreusablebuffers• Stablememoryuse• Avoidsyscalls
BufferPoolOp
[container,name]
Pool
36
Bioinformatics?
• Biology,computerscience,math,statistics• Startedmid90’swithHumanGenomeProject• Broadfield• Genomics,proteomics,systemsbiology
• Thistalk:WholeGenomeSequence(WGS)analysis• ReadingthelettersofyourDNA(ATCG…)
37