B. Distributed Application Runtime Environmentsc16.supercomputing.org/sc-archive/tech_poster/... ·...

Post on 25-May-2020

3 views 0 download

Transcript of B. Distributed Application Runtime Environmentsc16.supercomputing.org/sc-archive/tech_poster/... ·...

AbstractWe present a distributed computing platform with which significant roadblocks in Next-Generation Sequencing

(NGS) data analytics, associated with ever-growing and noisy data sets, can be effectively resolved. Merits andadvantages of the platform are exemplified with the specific application of de novo genome sequencereconstruction, a fundamentally critical task for studies of transcriptomes and metagenomes.

The core underlying framework is built upon the pilot system, Radical Pilot, allowing a developer to rapidlyimplement pipelines in which accessing and utilizing heterogeneous distributed computing environments arenatively supported. This aspect is further strengthened with the distributed application runtime environment thataims effective management of massive distributed workloads and data processing tasks over heterogeneousresources. Consequently, the platform becomes an efficient distributed parallel task management systemleveraging high-end HPC technologies as well as emerging Hadoop-based software models. The DOCKERcontainer technology is also available as options, implying all together increasing the repertoire for flexible andscalable runtime environment scenarios.

Based on the platform, a novel pipeline has been developed for metagenome and transcriptome research. Thepipeline is capable of the novel de novo genome sequence reconstruction with Multiple Assembler MultipleParameter (MAMP).

ShayanShams1,2,3,NayongKim1,2,Ming-TaiHa4,ShantenuJha4,JianTao1,RameshSubramanian2,VladmirChouljenko2,K.GusKousoulas2,RamJ.Ramanujam1,Seung-JongParK1,3,*,JoohyunKim1,2,*1.CenterforComputationandTechnology,LSU,2.CenterforExperimentalInfectiousDiseaseResearch,3.DivisionofComputerScience&Eng.,LSU,4.ElectricalandComputerEngineeringBuschCampus,RutgersUniversity

A.Hierarchicalstructureofheterogeneousdistributedresources,theirlocalruntimeenvironments,andcommonprogrammingmodelsofNGSapplications

III.ScalableTranscriptomeandMetagenomePipeline

AcknowledgementWe are grateful for collaborations and constructive discussions from Zhong Wang (JGI), and Xuan Guo (UTK). This work is supported in partby the funding from NIH P20 GM103458-10. We also thank the partial support from LBRN.

MAMPPerformance

DistributedApplicationWorkflows

Solution:MultipleAssemblerMultipleParameter(MAMP) facilitatedbydistributedcomputingtechniques

References1. Maddineni,Sharath,etal."Distributedapplicationruntimeenvironment(DARE):astandards-basedmiddlewareframeworkforscience-gateways."Journalof

GridComputing 10.4(2012):647-664.2. Shams,S.,Kim,N.,Meng,X.,Ha,M.T.,Jha,S.,Wang,Z.,&Kim,J.AScalablePipelineForTranscriptomeProfilingTasksWith On-demandComputingClouds.

HiCOMB2016,IEEEIPDPS(2016)3. RagothamanA,BodduSC,KimN,FeinsteinW,BrylinskiM,JhaS,KimJ.Developingethreadpipelineusingsaga-pilotabstractionforlarge-scalestructural

bioinformatics.BioMedresearchinternational.2014Jun9;2014.

I.DistributedPlatformArchitecture

B.DistributedApplicationRuntimeEnvironment

II.GenomeSequenceReconstruction

Question:Challengesindenovotranscriptandmetagenomeassembly

withNGSData• Genomereconstructionwithouta

referencesequenceisfundamentalbutchallenging.

• NGSrawdataareintrinsicallynoisy,soanalysisareerror-proneanddifficulttobeevaluated.

• DatasetsforNGSareoftenlargerthanthecapacityofasinglenodesolutionandwillgrowinarapidpace.

OverallPipelineWorkflow

Summaryv A pipeline tool for transcriptome profiling tasks and metagenome assembly can• Process massive data volumes thanks to distributed computing models.• Increase the accuracy of analytics against the nature of noisy data by developing a novel method with

MAMP, a kind of ensemble learning methods.v Distributed Application Runtime Environment (DARE) is a set of programs built upon Radical Pilot and open

source bioinformatics tools for NGS data analytics, primarily for effectively executing large scale workloadsand data processing.

v Heterogeneous computing resources such as our local HPCs (SuperMic (Intel cluster) and DELTA (IBM Power8cluster)) and cloud systems (Amazon EC2, OpenStack-based Chameleon, and IBM Bluemix) are successfullyintegrated.

v Multiple common local runtime environments such as those provided in HPCs with local schedulers(PBS/TORQUE and LSF), virtualization environment (EC2 and OpenStack), and the DOCKER containertechnology are seamlessly utilized for execution scenarios for the pipeline

Conclusionsv OurdistributedplatformoverheterogeneousHPCsandCloudsisanefficientsolutionfordevelopingtoolsthatare

capableofovercomingchallengesinNGSdataanalytics.v Newpipelinefortranscriptomeandmetagenome,alongwithanovelapproachfordenovosequence

reconstructionwithMAMP,underscoresbenefitsoftheplatform

MAMP-basedDeNovoGenomeSequenceReconstructionMethod• Contiggenerationusingmultipledenovoassemblers(Ray,ABySS,Contrail,etc)alongwithmultiplek-mers• Mainstrategyformergingthosecontigswithtwosteps,graph-basedclusteringandconsensus-basedmerging.This

schemeallowsalsovariousflexibleoptimizationstrategiesforabroadrangeofrelatedproblems.• Clusteringisimplementedwiththeweightededgegraph(DIME)• MergingcanbedonewithVMATCH/minimus2,CAP3,andDIME• Promisingmethodforcontrollingfalsenegativesandfalsepositives,andnotablycomputationallydemandingdueto

massiveheterogeneoustasksrequired

MethodCategory Assemblerused(Clustering andMergingmethod)

Precision, Recall,F1(Nucleotide-level)

Weightedk-merrecall,kcscore

RSEM-eval(higherisbetter)

Standalone Trinity 0.51(7),0.35(6),0.42(6) 0.84,0.83(2) -236,024,028.9(6)

Single Ray(V-m2:VMATCH/minimus2) 0.84(1),0.26(7),0.40(7) 0.86,0.86(1) -239,866,658.0(7)

Assembler ABySS(V-m2) 0.82(2),0.42(5),0.55(5) 0.79,0.78(4) -225,154,884.4(4)MultipleK-mer Contrail(V-m2) 0.78(5),0.43(3),0.56(2) 0.84,0.83(2) -222,219,796.1(1)

Multiple Ray+Contrail(V-m2) 0.78(5), 0.43(3),0.56(2) 0.78,0.77(5) -225,887,718.1(5)

Assembler Ray+Contrail+ABySS (V-m2) 0.79 (3),0.44(1),0.57(1) 0.77,0.76(6) -224,433,916.1(3)MultipleK-mer Ray+Contrail+ABySS (Cap3) 0.79(3),0.44(1),0.56(2) 0.77,0.76(6) -224,236,495.7(2)

Ray+Contrail+ABySS (MAMP)1.Before Clustering2.Clustering/CAP33.Clustering/V-m2

N/A, 0.47,N/A0.79,0.44,0.560.80,0.40,0.52

0.93, 0.920.78,0.780.79,0.78

-223,943,006.9-229,153,749.0

Schematicofthepipelinefortranscriptome/metagenomeanalysis

Thepilotsystem(RadicalPilot)forefficienttask-levelanddata-levelparallelexecutionsofsubtaskswithdistributedresources.Threedifferentpatternsonhowthesesubtasksbelongingtoeachstageofapipelineare

managed

Scale-outPerformanceofDistributedApplication

Large-scaledatasetcanbeprocessedwithdistributedparallelapplicationmodelssuchasMPI(RayandABySS)andHadoop

(Contrail).Thefullsupportoftheseprogrammingmodels,regardlessofdifferentlocalruntimeenvironment(HPC/PBS,HPC/LSF,EC2,OpenStack,Bluemix,andDOCKER),isthekeyforourplatform.

HeterogeneousDistributedResources

Anexampleofthepilot-basedworkflowwithEC2.Awiderangeofoptionsforscale-outandscale-acrossscenarioswithmultipledistributedresourcesandcorrespondinglocalruntime

environmentsaresupportedinasystematicmanner

Clustering

TranscriptIdentification

Pre-processing

Contiggenerationwithdenovoassembly

MAMPReconstruction