B. Distributed Application Runtime Environmentsc16.supercomputing.org/sc-archive/tech_poster/... ·...

AbstractWe present a distributed computing platform with which significant roadblocks in Next-Generation Sequencing

(NGS) data analytics, associated with ever-growing and noisy data sets, can be effectively resolved. Merits andadvantages of the platform are exemplified with the specific application of de novo genome sequencereconstruction, a fundamentally critical task for studies of transcriptomes and metagenomes.

The core underlying framework is built upon the pilot system, Radical Pilot, allowing a developer to rapidlyimplement pipelines in which accessing and utilizing heterogeneous distributed computing environments arenatively supported. This aspect is further strengthened with the distributed application runtime environment thataims effective management of massive distributed workloads and data processing tasks over heterogeneousresources. Consequently, the platform becomes an efficient distributed parallel task management systemleveraging high-end HPC technologies as well as emerging Hadoop-based software models. The DOCKERcontainer technology is also available as options, implying all together increasing the repertoire for flexible andscalable runtime environment scenarios.

Based on the platform, a novel pipeline has been developed for metagenome and transcriptome research. Thepipeline is capable of the novel de novo genome sequence reconstruction with Multiple Assembler MultipleParameter (MAMP).

ShayanShams1,2,3,NayongKim1,2,Ming-TaiHa4,ShantenuJha4,JianTao1,RameshSubramanian2,VladmirChouljenko2,K.GusKousoulas2,RamJ.Ramanujam1,Seung-JongParK1,3,*,JoohyunKim1,2,*1.CenterforComputationandTechnology,LSU,2.CenterforExperimentalInfectiousDiseaseResearch,3.DivisionofComputerScience&Eng.,LSU,4.ElectricalandComputerEngineeringBuschCampus,RutgersUniversity

A.Hierarchicalstructureofheterogeneousdistributedresources,theirlocalruntimeenvironments,andcommonprogrammingmodelsofNGSapplications

III.ScalableTranscriptomeandMetagenomePipeline

AcknowledgementWe are grateful for collaborations and constructive discussions from Zhong Wang (JGI), and Xuan Guo (UTK). This work is supported in partby the funding from NIH P20 GM103458-10. We also thank the partial support from LBRN.

MAMPPerformance

DistributedApplicationWorkflows

Solution:MultipleAssemblerMultipleParameter(MAMP) facilitatedbydistributedcomputingtechniques

References1. Maddineni,Sharath,etal."Distributedapplicationruntimeenvironment(DARE):astandards-basedmiddlewareframeworkforscience-gateways."Journalof

GridComputing 10.4(2012):647-664.2. Shams,S.,Kim,N.,Meng,X.,Ha,M.T.,Jha,S.,Wang,Z.,&Kim,J.AScalablePipelineForTranscriptomeProfilingTasksWith On-demandComputingClouds.

HiCOMB2016,IEEEIPDPS(2016)3. RagothamanA,BodduSC,KimN,FeinsteinW,BrylinskiM,JhaS,KimJ.Developingethreadpipelineusingsaga-pilotabstractionforlarge-scalestructural

bioinformatics.BioMedresearchinternational.2014Jun9;2014.

I.DistributedPlatformArchitecture

B.DistributedApplicationRuntimeEnvironment

II.GenomeSequenceReconstruction

Question:Challengesindenovotranscriptandmetagenomeassembly

withNGSData• Genomereconstructionwithouta

referencesequenceisfundamentalbutchallenging.

• NGSrawdataareintrinsicallynoisy,soanalysisareerror-proneanddifficulttobeevaluated.

• DatasetsforNGSareoftenlargerthanthecapacityofasinglenodesolutionandwillgrowinarapidpace.

OverallPipelineWorkflow

Summaryv A pipeline tool for transcriptome profiling tasks and metagenome assembly can• Process massive data volumes thanks to distributed computing models.• Increase the accuracy of analytics against the nature of noisy data by developing a novel method with

MAMP, a kind of ensemble learning methods.v Distributed Application Runtime Environment (DARE) is a set of programs built upon Radical Pilot and open

source bioinformatics tools for NGS data analytics, primarily for effectively executing large scale workloadsand data processing.

v Heterogeneous computing resources such as our local HPCs (SuperMic (Intel cluster) and DELTA (IBM Power8cluster)) and cloud systems (Amazon EC2, OpenStack-based Chameleon, and IBM Bluemix) are successfullyintegrated.

v Multiple common local runtime environments such as those provided in HPCs with local schedulers(PBS/TORQUE and LSF), virtualization environment (EC2 and OpenStack), and the DOCKER containertechnology are seamlessly utilized for execution scenarios for the pipeline

Conclusionsv OurdistributedplatformoverheterogeneousHPCsandCloudsisanefficientsolutionfordevelopingtoolsthatare

capableofovercomingchallengesinNGSdataanalytics.v Newpipelinefortranscriptomeandmetagenome,alongwithanovelapproachfordenovosequence

reconstructionwithMAMP,underscoresbenefitsoftheplatform

MAMP-basedDeNovoGenomeSequenceReconstructionMethod• Contiggenerationusingmultipledenovoassemblers(Ray,ABySS,Contrail,etc)alongwithmultiplek-mers• Mainstrategyformergingthosecontigswithtwosteps,graph-basedclusteringandconsensus-basedmerging.This

schemeallowsalsovariousflexibleoptimizationstrategiesforabroadrangeofrelatedproblems.• Clusteringisimplementedwiththeweightededgegraph(DIME)• MergingcanbedonewithVMATCH/minimus2,CAP3,andDIME• Promisingmethodforcontrollingfalsenegativesandfalsepositives,andnotablycomputationallydemandingdueto

massiveheterogeneoustasksrequired

MethodCategory Assemblerused(Clustering andMergingmethod)

Precision, Recall,F1(Nucleotide-level)

Weightedk-merrecall,kcscore

RSEM-eval(higherisbetter)

Standalone Trinity 0.51(7),0.35(6),0.42(6) 0.84,0.83(2) -236,024,028.9(6)

Single Ray(V-m2:VMATCH/minimus2) 0.84(1),0.26(7),0.40(7) 0.86,0.86(1) -239,866,658.0(7)

Assembler ABySS(V-m2) 0.82(2),0.42(5),0.55(5) 0.79,0.78(4) -225,154,884.4(4)MultipleK-mer Contrail(V-m2) 0.78(5),0.43(3),0.56(2) 0.84,0.83(2) -222,219,796.1(1)

Multiple Ray+Contrail(V-m2) 0.78(5), 0.43(3),0.56(2) 0.78,0.77(5) -225,887,718.1(5)

Assembler Ray+Contrail+ABySS (V-m2) 0.79 (3),0.44(1),0.57(1) 0.77,0.76(6) -224,433,916.1(3)MultipleK-mer Ray+Contrail+ABySS (Cap3) 0.79(3),0.44(1),0.56(2) 0.77,0.76(6) -224,236,495.7(2)

Ray+Contrail+ABySS (MAMP)1.Before Clustering2.Clustering/CAP33.Clustering/V-m2

N/A, 0.47,N/A0.79,0.44,0.560.80,0.40,0.52

0.93, 0.920.78,0.780.79,0.78

-223,943,006.9-229,153,749.0

Schematicofthepipelinefortranscriptome/metagenomeanalysis

Thepilotsystem(RadicalPilot)forefficienttask-levelanddata-levelparallelexecutionsofsubtaskswithdistributedresources.Threedifferentpatternsonhowthesesubtasksbelongingtoeachstageofapipelineare

managed

Scale-outPerformanceofDistributedApplication

Large-scaledatasetcanbeprocessedwithdistributedparallelapplicationmodelssuchasMPI(RayandABySS)andHadoop

(Contrail).Thefullsupportoftheseprogrammingmodels,regardlessofdifferentlocalruntimeenvironment(HPC/PBS,HPC/LSF,EC2,OpenStack,Bluemix,andDOCKER),isthekeyforourplatform.

HeterogeneousDistributedResources

Anexampleofthepilot-basedworkflowwithEC2.Awiderangeofoptionsforscale-outandscale-acrossscenarioswithmultipledistributedresourcesandcorrespondinglocalruntime

environmentsaresupportedinasystematicmanner

Clustering

TranscriptIdentification

Pre-processing

Contiggenerationwithdenovoassembly

MAMPReconstruction

B. Distributed Application Runtime Environmentsc16.supercomputing.org/sc-archive/tech_poster/... ·...

Documents

Transcript of B. Distributed Application Runtime Environmentsc16.supercomputing.org/sc-archive/tech_poster/... ·...

STView: An Eclipse Plug-in Tool for Visualizing Program ...sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post... · - a non-proprietary tool under the open-source software

Design of a NVRAM Specialized Dynamic Graph Data Structuresc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/… · Streaming edges are ingested in Sorted, Random, or

Improved Global Weather Prediction with GFDL’s FV3 ...sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post231s2-file2.pdf1Engility, 2NOAA/Geophysical Fluid Dynamics Laboratory

DiSC: A Distributed Single-Linkage Hierarchical Clustering ...sc13.supercomputing.org/sites/default/files/WorkshopsArchive/pdfs/… · evaluate the DiSC algorithm using synthetic

Optimizing CUDA Shared Memory Usage - SC15sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Optimizing CUDA Shared Memory Usage Shuang Gao ... through manual

LIKWID 4 Tools Architecture - SC16sc16.supercomputing.org/sc-archive/tech_poster/poster... · 2017. 3. 20. · Lua APIs for building tools and applications. $ likwid-perfctr -a .

A Parallel R Framework for Processing Large Dataset on ...sc13.supercomputing.org/sites/default/files/... · A Parallel R Framework for Processing Large Dataset on Distributed Systems

Distributed NoSQL Storage for Extreme-scale System Servicessc15.supercomputing.org/sites/all/themes/SC15images/... · 2016-05-10 · SQL Databases Large Various - small O(10) ms Very

A B C D Read Seq Ref Seq Linear PE Array 1 2 Typical sizessc15.supercomputing.org/.../tech_poster/poster_files/post269s2-file2… · Reference e) Platform Achitecture Device GCUPS

Distributed Transactions 7. Transaction …Distributed Transactions 7. Transaction Management for distributed databases 28 Distributed Transactions 29 Distributed Commit 30 Distributed

OPESCI: Open Peformance portablE Seismic Imagingsc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · OPESCI: Open Peformance portablE Seismic Imaging ... (DSL)andcode

The High Performance Open Community Runtimesc16.supercomputing.org/sc-archive/tech_poster/poster...The High Performance Open Community Runtime: Explorations on Asynchronous Many Task

Analyzing Ultra-Scale Application Communication ...sc05.supercomputing.org/schedule/pdf/pap373.pdfAnalyzing Ultra-Scale Application Communication Requirements for a Reconﬁgurable

ARGONNE LEADERSHIP COMPUTING FACILITY Molecular ...sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/poster_files/post274s2...of the recent advances in X-ray, cryo-electron

A Multiple Time Stepping Algorithm for Efficient ...sc14.supercomputing.org/sites/all/themes/sc14/files/archive/tech_poster/poster_files/... · A Multiple Time Stepping Algorithm

Scalable Analysis Techniques for Microprocessor ...supercomputing.org/sc2002/paperpdfs/pap.pap257.pdf · Scalable Analysis Techniques for Microprocessor Performance Counter Metrics

Extreme Fidelity Computational Electromagnetic Analysis in ...sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post... · Extreme Fidelity Computational Electromagnetic

Large-Scale GW Calculations on Pre-Exascale HPC Systems Archive/tech_poster/poster_files... · Large-Scale GW Calculations on Pre-Exascale HPC Systems BerkeleyGW: Method Developments

Teaching Parallel Computing through Parallel Prefix - SC12sc12.supercomputing.org/hpceducator/ParallelPrefix/ParallelPrefix.pdf · Pre x sum Applications Teaching Parallel Computing

The HPC PowerStack: A Community-wide Collaboration Towards ...sc19.supercomputing.org/proceedings/tech_poster/poster_files/rpost… · •A community-wide collaboration to incorporate