F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker...

25
It takes [gene, dtn, human] networks to understand biological systems F. Alex Feltus, Ph.D. Clemson University Dept. of Genetics & Biochemistry (Professor) CU-MUSC Biomedical Data Science & Informatics Program (Member) CU Center for Human Genetics (Member) Allele Systems LLC (CEO) Internet2 Board of Trustees (Member) [email protected] I2 Global Summit @ March 7 2018 1.15pm EST

Transcript of F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker...

Page 1: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

It takes [gene, dtn, human] networks to understand biological systems

F. Alex Feltus, Ph.D.Clemson University Dept. of Genetics & Biochemistry (Professor)

CU-MUSC Biomedical Data Science & Informatics Program (Member)CU Center for Human Genetics (Member)

Allele Systems LLC (CEO)Internet2 Board of Trustees (Member)

[email protected] Global Summit @ March 7 2018 1.15pm EST

Page 2: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Angiosperms

My BIOLOGY Lab = 1/3 Animal; 1/3 Plant; 1/3 Computational

Vertebrates

Bioinformatics/ Cyberinfrastructure

Page 3: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Biological Measurements are Getting Too Big For Paper and Laptops

DNA Sequencing Costs Are Dropping DNA Databases are Swelling

Source: NCBI SRASource: NIH

20th Century Storage Rack

21st CenturyDatastore

25.7 PBs in 10 yrs

Page 4: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Quick Overview of Genomics

774,746 Words26 letter alphabet

58,721 Words (Genes)4 letter alphabet (ATGC)

John 3:16: For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life.

John 3:16*: For God so loved the world that he gave his one and only Sun, that whoever believes in him shall not perish but have eternal life.One letter change = confused congregation!

One letter change = higher risk for breast and ovarian cancer!

AATGGAGCCACATAACACATTCAAACTTACTTGCAAAATAT

AATGGAGCCACATAACACATGCAAACTTACTTGCAAAATAT

Page 5: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

What is a genome? What is a gene?Human Genome “Facts”Initiation instructions for you!(3.2 billion*2) ATGC23 chromosome pairs (23andME!)24 unique chromosomes (1-22, X, Y) 24 strings (Chr1 = 250 million ATGCN)58,721 gene intervals

GeneGGTTTCTATCTTACTAAAGGATATTTCAGAAAATCTATATTCACTGAGGAGGATGATAATTGGGTCTACTAACATTGAGACTGAACTGAGGCCCAGCAATAATTTAAACTTATTATCCTTTGAAGATTCAACTACTGGGGGAGTACAACAGAAACAAATTAGAGAACATGAAGTTTTAATTCACGTTGAAGATGAAACATGGGACCCAACACTTGATCATTTAGCTAAACATGATGGAGAAGATGTACTTGGAAATAAAGTGGAACGAAAAGAAGATGGATTTGAAGATGGAGTAGAAGACAACAAATTGAAAGAGAATATGGAAAGAGCTTGTTTGATGTCGTTAGATATTACAGAACATGAACTCCAAATTTTGGAACAGCAGTCTCAGGAAGAATATCTTAGTGATATTGCTTATAAATCTACTGAGGTACTAAATAAAGAGGAAGCACATTTTTAGTTATTAGTAGGTTCTGGCAGACTTTATTCCCGTAAAGAGACAGATAGTAAATATTTTAGGCTTTGTGGGACATACAGTCTCTGTTGCAGTAACTCAGTTCTGTTATTGTAGTGTGAAAGCAGCCACAGACAAATGTAAACAAATGTGCCTGTCTTTCAATAAAACTTTATCTACAAATACAGATAGTGGGCCACATTTGGCTTGCGAGCTGTAGTTTGCTGTTCCTGACTTAGTATGATATGCAGCATTGTGTAGTGGAAAGATTTTATAGTATCAACAGACATTTGTAGGAGTTTGCTATTTTATAATTTCGTATGCTTTTTTCATAATAGATGTCTCTGCTTTTTGTGGATTCTCTTCATATCTACTAGTGTTTACTCTTCTGACTTTGATTTTTGGGGGTGTATTTTATTTCTTTATACAGAATTTTTTTCTTTTTTCTTTCTTTTTTTGAGACAAGGTTTTGCTGTGaTCACCCCAGCTGGAGTGCAGTGGTGGGATCTTGGCCCCCTGCGGTCTTCAACTCCTGGGCTTAAGCGATCCTGTCACCTCAGCCTCCCTAGTAGCTGGGACCACAGGCCTGCACCACTACACTCAGCTAATTTTTGTATTTTTTGTAGAGATGGGGTTTCGCTCTGTTGCCCAGACTGGTCTCCAACTCCTGGACTCAAGTGATCCACCTGCCTTGGCCTCCCAAAGTGTTGGAATTATAGGTGTGAGTCACTGTGCCTGGCCCAGAATTTCTTTACTTGCTCATGTGAGAGAGTTTCTAATAAACTCCTGATGCTTTTGCTTGACACTATAGTTTTACTGTCATCTCAGAGTCAGATTTTTTCCTTAATTTCTTCATTAAATTGATTTCTTAATTTCTCATTTGTTTTGCATTAGTTTACAAAATTACTTAAGATGCCTATATTCTCTGAAATTTTATATTTTATTTCACGATAAATAAGTTTAAAAAATTAGATTTGACCATTTAATTAGAGTTATTTCCCCATCATTTTTCTAATGGCAACAATAATTTTAGCATTTGAGCAATTCTCTGTATTTTGGTAACTATTAGTAATCACTTTTATCAGTATAAACATGCATTTCATTTTAAAATGGTTTTCAAAATATTCTCATACCAACCATTCATAAAAGATGTTCCCCTTCTGGGAAATATCAGTTTTTGCTAACATCTGCATTTCCCTTTGATTTTCTAGGAATTTCTTATTAGGTATACACTTTAAGTGATATGTTTCATATAATATTATAGGATTATCAGGATTTCATGATTTTTTTTTTTTGAGATGGAATCTATCTCTGTCGCCCAAGCTGGAGTGCCGTGGCGCAGTCTTGGCTCACTACAACCTCTGCCTCCCCGGTTCAAGTGATTCTCCTACCTCAGCCTCCCGAGTCACTGGGATTACAGGTGCCCGCCACCACACCCGGCTAATTTTTGTATTTTTAGTAGAGAAAGTGCTGGGATTACAGGTGTGAGCCACTGCACCTGGCCTTGGATTTCGTGATTTGACTGACATCTTTTACACCAATTATTGCCATTTTTTTATTGGGTACTAATTATTAGTATTGCTTTCTTTATTCGTTAAATAAGAAGTAAAAGAACTTTTAAAAAATAAAATAACTTTTTTTCCTAGTACAAAGCTTATTTTCAATTTTTTTTATTGCTTAAATTTTGCAGGTTAATTAAGGCTGTACTTCTTTTTCCTACTTCCTGTTTAAATATTTGGCATTTTTTGAGGGTTTTCTTTCTATGGATTTTTGTGGGGGATCCCATCTCGTTGGACACTTGGTTCCTGGAGATTGCCAAATTATTGAAAAGTTTCCTCATACCCAAACATTGGCAATAGACTGAATAAATAACATCTCTGACATTATTTTTCAATTATATAGAGAATGCTTTTTGTAAAATACTGATTTATTTAAAGATATTACAATCATAGTAGTTTTTTGCAATCTTTTTTAAATCAGTGGATGTCAAGGAACAACTATTGTTCTTTGTACGGCATTTTTCAACATTGTGAAGAGTTCTTATACTCAAAAAGTTTGGGAGACACAGTGTGACATTGTTTAACCTGAGTTGAACTTGTCATTTTGTATTCTTGTTTAGAGCAGGGATTTTATTTTGTCTCTTTATCTAACTTTGTATTCCTAACTATTTCTTTCTTTCTCTTTGGTTCTTTGTTTCTTTGTTCATTCTTCTCTCTCTCCCTTTTTTATCTATCATATAAATATGTAAATATATATTATATATCCATTAGGGAAGAGGAAGTTGTCTAAAGATATCTAGTATATAGGAGCTTTGTTCTGCAGAAAAGAGGATATGAAGTCAATTATATTGGAAATTAATGCTTAATACTTTTTTTAAAGCATTTATCTCCCAATGATAATGAAAACGATACGTCCTATGTAATTGAGAGTGATGAAGATTTAGAAATGGAGATGCTTAAGGTATGTTTACAATTATAAAAATATTACTTCAAGTTCTTTCCAAAGGACATTTAATTAAGTAAAATATTAACTAATTCTAAACTAGGTTCTACCACAATGAAATTGCTACTAATTATGTAACATTAGATTTCACATTTTCCAATTCATGTTTCTTTCATGTAGTCTATAAATAATGGGTTAGAGGTAATTTACTAATTTTAAATGTGCTTTCCTTGTTCCTTCTTATTTTTTATTTTGAAGACAGGGTCTCACTCTATCACCCAGGCTGGAGTGCAGTTGCTCCATCTCGGCTCACTGCAACCTTCACCTCCTGGGCTCAAGTGATCCTCCTGCATAAGCCTCCCGAGTAGCTGGGATTATGGGCGTGCACCACCATGCCCGGCTAATTTTTATAGTTTTAGTAGAGACAGGGTCTCACCATGTTGTCCTTGCTGGTCTCGAACTCTTTACCTCAAGTAATCCACCCGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACCACACCTGGCCTTGGATTTCATGATTTGACTGACATCTTTTACACAAATTATTGCCATTTTTTTATTGGGTACCAATTATTAGTATTGTTTTCTTTATTCGTTAAATAAGAAGTAAAATAACTTTTTTTTAAAAAAAGAACTTTTTTCCTAGTAAAAAGCTTATTTTCAATTTTTTTATTGCTTAAATTTTGCAGGTTAATTAAGGCTGTGCTTCCTTTTCCTACTTTCTGTTTAAATGTTTGGCATTTTTTGAGGGTTTTCTTTCTATGGATTTTTGTGGGGGATCCCATCTTGTTGGACACTTGGTTCCTGGAGATTGCCAAATTATTGAAAAGTTTCCTCATACCCAAACATTGGCAGTAGACTGAATAAATAACATCTCTGACACTATTTTTCAATTATATAGAGAATGCTATTTGTAAAATACTGATTTATTTAAAGATATTACAGTCAGTAGTTTTTTGCAATCTTTTTAAAATCAGTGGATGTCAAAAACAGCTATTGTTCTTTGTACAGCATTTTTCAACATTGTGAAGAGTTCTTATACTCAAAAAGTTTGGGAGACACAGTTTGATATTGTTTAACCTGAGTTGAACTTGTCATTTTGTATTCTTGTTTAGAGCAGGGATTTTATTTTGTCTCTTTATCTAACTTTGTATTCCTATTTCTTTCTTTCTCTTTGGTTCTTTGTTTCTTTGTTCATTCTTCTCTCTCTCCCTTTTTTATCTATCATATAAATATGTAAATATATATTATATATCCATTAGGGAAGAGGAAGCTGTCTAAAGATATCTAGTATATAGGAGCTTTGTTCTGCAGAAAAGAGGATATGAAGTCAATTATATTGGAAATTAATGCTTAATACTTTTTTTTAAAGCATTTATCTCCCAATGATAATGAAAACGATACGTCCTATGTAATTGAGAGTGATGAAGATTTAGAAATGGAGATGCTTAAG

Page 6: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Chromosome Information is Accessed with Enzymes;Blu Ray Information is Accessed with Lasers

http://www.bigbrownboxblog.com.auhttp://www.rcsb.org/pdb/101/motm.do?momID=40

~The DNA molecule can be copied (Replication), transcribed (Transcription), and converted into protein (Translation)

DNA

PROTEIN

Page 7: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Change DNA and Change The Person: Coffee Example

https://www.23andme.com/health/Caffeine-Metabolism/

TGTGGGCACAGGAC

TGTGGGCACAGGAC

TGTGGGCCCAGGAC

TGTGGGCACAGGAC

TGTGGGCCCAGGAC

TGTGGGCCCAGGAC

The sequence of DNA contains information on how an organism responds to the world.

Page 8: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Public Data Source: The Cancer Genome Atlas

https://portal.gdc.cancer.gov/

PubMed “the cancer genome atlas” yields 5,470 publications

In April 2018, The Cancer Genome Atlas (TCGA) Research Network marked the end of the TCGA program by publishing the Pan-Cancer Atlas Exit Disclaimer: a collection of cross-cancer analyses delving into overarching themes on cancer, including cell-of-origin patterns, oncogenic processes and signaling pathways. The data remains available to the public for further mining through the Genomic Data Commons. In light of this milestone, the following events will provide further discussion of our current understanding of the disease, how basic research is changing patient treatment, and the future of multi-omic studies in cancer.

Page 9: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

We mine genomics data for complex gene expression patterns

William L. Poehlman, James Hsieh, and F. Alex Feltus. “Linking Binary Gene Relationships to Drivers of Renal Cell Carcinoma Reveals Convergent Function in Alternate Tumor Progression Paths.” Scientific Reports (in press)

Kidney Cancer Biomarker Network

~1000 kidney tumors

Page 10: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

What are the complex genetic interactions underlying Autism and other IDs?

Emily L. Casanova, Zachary Gerstner, Julia L. Sharp, Manuel F. Casanova, F. Alex Feltus. “Widespread Genotype-Phenotype Correlations in Intellectual Disability.” Frontiers in Psychiatry. 29;9:535.https://doi.org/10.3389/fpsyt.2018.00535, 2018.

SPARK (Simons Foundation Powering Autism Research for Knowledge)

{Sensitive human data that needs HPC}

https://sparkforautism.org/

Page 11: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

We like plants too!

https://genelab.nasa.gov/about/

Josh Vandenbrink@ Louisiana TechJohn Kiss@UNC-Greensboro

Julia Frugoli lab@ Clemson University

Organism: Medicago truncatulaTrait: In plant fertilizer production.

http://lasernode.org/

Organism: Arabidopsis thalianaTrait: <1g plant development

Page 12: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

My Group Processes Data at the Petascale on Democratized Systems

SciDAS (Scientific Data Analysis at Scale): NSF CC*

DistributedCloud

Computing

In 2017 on OSG …8.43 Million Wall Hours

(962 years on a laptop)4.50 Million CPU Hours8.92 Million Jobs16.6 Million Transfers4.07 PB

PRP/TNRP Kubernetes Cluster

Clemson Palmetto Cluster

11/2018 Top500

Page 13: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

FASTQraw

FASTQclean

Gene Expression Matrix (GEM)

BAM

Public DatabasesNCBI; EBI; GDC

ImageAnalysis

Preprocessing

GenomeAlignment

BDSS

ReferenceGenome

GFF3

Counting Normalize

NDN

DNA

Diverse Downstream WorkflowsGene Expression QuantificationDifferential Gene Expression AnalysisDNA Polymorphism DiscoveryBiomarker AnalysisGene Co-expression Analysis

Data Grids Feed HPC workflows with Tera-/Petabytes of DNA Data

iRODs

hICN

Page 14: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Virtualization System Metadata to encode rich information

Rule engine programmed with rules to enact policies

Data Federation

iRODS Distributed Datagrid• Integrated Rule Oriented Data System (iRODS)

provides a distributed unified namespace over SciDAS storage infrastructure across Clemson, RENCI and WSU (4.2 petabyte Data Grid;1232 indexed genomes; GEM-GCN Storage)

• iRODS provides enable policy-driven management critical to data-sharing collaborations in SciDAS

Terrell Russell, Michael Stealey, Jason Coposky, Ben Keller, Claris Castillo, Ray Idaszak, Alex Feltus. Distributing the iRODS Catalog: A Way Forward. iRODSUGM 2017 Proceedings. Page 35, 2017.

Page 15: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Pushing Genomics Data onto Named Data Networking (NDN) Framework• 1,232 Genomes pulled from iRODS-SciDAS Data Grid (NSF CC*1659300)

• >500 Gigabytes in Aggregate (tar-gz compressed)• WSU indexed; CSU-CU moved to NDN• Naming: ‘Genus->Species->Intraspecies->Assembly->Files’• NDN Caching near Genomics Workflows!

Hierarchical Named Data Format[genus][species]{[infraspecific name]}/[assembly_name] that contains these files:> [genus]_[species]{_[infraspecific name]}-[assembly_name].{n}.ht2> [genus]_[species]{_[infraspecific name]}-[assembly_name].gff3> [genus]_[species]{_[infraspecific name]}-[assembly_name].fasta> [genus]_[species]{_[infraspecific name]}-[assembly_name].gtf> [genus]_[species]{_[infraspecific name]}-[assembly_name].Splice_sites> [genus]_[species]{_[infraspecific > name]}-[assembly_name].meta.json

/scidasZone/sysbio/PynomeGenomes/Genome/Zymoseptoria_tritici/MG2:> Zymoseptoria_tritici-MG2.1.ht2> Zymoseptoria_tritici-MG2.2.ht2> Zymoseptoria_tritici-MG2.3.ht2> Zymoseptoria_tritici-MG2.4.ht2> Zymoseptoria_tritici-MG2.5.ht2> Zymoseptoria_tritici-MG2.6.ht2> Zymoseptoria_tritici-MG2.7.ht2> Zymoseptoria_tritici-MG2.8.ht2> Zymoseptoria_tritici-MG2.fa> Zymoseptoria_tritici-MG2.gff3> Zymoseptoria_tritici-MG2.gtf> Zymoseptoria_tritici-MG2.meta.json> Zymoseptoria_tritici-MG2.Splice_sites

Genome Example

Fungal Genet Biol. 2015 Jun; 79: 17–23.

Christos Papadopoulos (CSU)Susmit Shannigrahi (CSU)Chengyu Fan (CSU)Stephen Ficklin (WSU)

https://named-data.net/

Page 16: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

NDN Data Discovery Via a Web Based UI and Moved (or copied from cache) to an Endpoint on the Network

http://atmos-sac.es.net/genome/

Page 17: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

We are exploring ways to move huge genomics data onto I2/Cisco Content-Centered Networking (CCN) Testbed using Hybrid Information Centered Networking (hICN)

https://www.cisco.com/c/dam/en_us/solutions/industries/docs/education/information-centric-networking-education.pdf

https://fd.io/2019/02/introducing-hybrid-information-centric-networking-hicn-a-new-fd-io-project/

CCN Mailing List CiscoInternet2ClemsonColorado StateFlorida International UniversityGeorge Washington NISTNorthwesternRutgersU. MichiganU. UtahU. WashingtonUC-Santa CruzUT-San Antonio

Page 18: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

SciDAS Ecosystem: CI, clouds and community platforms

Page 19: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

SciApps: Towards Reproducible Science• Scientific applications are available in the form of

SciApps “virtual appliances” (CC-ADAMANT)• Borrowed concept from ’virtual appliance’, i.e., virtual machine

image• A SciApp is configured with the application software needed to

reproduce an experiment with the highest fidelity possible• A SciApp may consist of multiple containers spanning over a

virtual network across multiple clouds and CI facilities • Parameterizable templates will be provided with same

defaults to meet the needs of the scientists[CC-ADAMANT] Enabling Workflow Repeatability with Virtualization Support, Fan Jiang et.al. Workshop on Workflows of Large-Scale Science, Supercomputing Conference (SC15), Austin, Texas,2015.

Page 20: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

{"id": ”pegasus-htc","containers": [{

"id": "submitter","image": "scidas/kinc-submitter","resources": { "cpus": 2, "mem": 4096,"disk": 10240},"cluster": "chameleon","port_mappings": [{"container_port": 22, "host_port": 0,"protocol": "tcp"}],"args": [

"-f","chameleon-master,aws-master,azure-master", "-k", "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC303e2y8aUaMQ1IkHWnGFyb5XykxOM5pLK83XFxWZMKsbYcgmkoODZ4w4COratlQPyMXSz7yaFUbYUccXjIjz8SDZf/9c3xI0UuILOiVfb5Ql/dsfssgsfvxcvfdsss321nksnvsnvlkvlksvkkdkddvlkssvn/xk+TORZYK3CE3Oqu9p77nrFM7W3M5khsb5Qg/z0W1TQmVWvo5/i3QbDK6YaWhw/0DXjfCeEtdlTVdIq1EJxMWuJnm5IptB1EtG9GBhuHq5Ct2XkUh",

"-u”, "irodsuser","-p",“fdsfsfsdczxv3rr3r","-h","irods-renci.scidas.org", "-z", “irodsZone"

]},{

"id": "chameleon-master","image": "scidas/htcondor-worker-centos7:1","cluster": "chameleon","resources": {"cpus": 48,"mem": 49152,"disk": 10240

},

Input Output

HTCondor-Pegasus SciApp

Page 21: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Just out of the Box: Slurm SciApp

A collection of Docker images that instantiates a CentOS-based Slurm cluster.● 1 head, 1 database, N compute nodes.● Uses GlusterFS as a shared file system(NFS in

development).● Utilizes Lmod module system to provide a

dynamic execution environment.● Currently configured to run any Nextflow, CWL, or

Toil workflows.Users get a personalized HPC cluster with root access!

Page 22: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

CloudyCluster

cloudycluster.com

Page 23: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Feltus LabYuqing Hang (<PhD, G&B)Benafsh Husain (<PhD co-train, BDSI)Allison Hickman (<PhD, G&B)Ben Shealy (<PhD co-train, ECE)Colin Targonski (<MSc, co-train, ECE)Yueyao Gao (<PhD, G&B)Rachel Eimen (<Bsc, ECE)Courtney Shearer (<BSc, CS)Cole McKnight (<Bsc, CS)Jordan Little (<BSc, G&B)Ethan Bensman (<BSc, G&B)Reed Bender (<BSc, G&B)

Recent alumni*Will Poehlman (PhD, G&B)*Melissa Judge (<BSc, Bioengineering)*Keerti Kosana(<BSc, CS)*Henry Randall (<Bsc, Bioengineering)*Leland Dunwoodie (<MD, G&B)*Olivia Feltus (<BSc, Intern)*Nick Watts (Programmer, CCIT)*Zach Gerstner (<MS, Microbiology)*Jack Fletcher (<Bsc, REU)*Kim Roche (<PhD, CCIT, G&B)*Brittany Rosener (BSc, G&B)*Michael Sullivan (<BSc, G&B)

The most important network is people!@ ClemsonKaran Sapra (ECE)Melissa Smith (ECE)KC Wang (ECE/CCIT)Walt Ligon (ECE)Nick Mills (ECE)John Calhoun (ECE)Brian Dean (CS)Marc Birtwistle (ChemE)Julia Frugoli (G&B)Suchitra Chavan (G&B)Elsie Schnabel (G&B)Susan Duckett (AVS)Jessi Britt (AVS)Markus Miller (AVS)Stephen Kresovich (PES)Zach Brenton (PES)Randy Martin (CCIT)Corey Ferrier (CCIT)Jim Pepin (CCIT)Clemson Networking (CCIT)Clemson CITI (CCIT)

@ EarthStephen Ficklin (WSU)Josh Burns (WSU)Tyler Biggs (WSU)Dorrie Main (WSU)Sook Jung (WSU)Joe Breen (Utah)Jill Wegrzyn (UCONN)Meg Staton (UT-Knoxville)Jim Bottum (Internet2)Ana Hunsinger (Internet2)Marvin Weinstein (Quantum Insights LLC)Ken Matusow (Synergity)Don Preuss (Starfish Storage)

@ EarthMats Rynge (USC-OSG)Bala Desinghu (U Chicago-OSG)Andrew Paterson (UGA)Claris Castillo (RENCI)Ray Idaszak (RENCI)Paul Ruth (RENCI)Michael Stealy (RENCI)Fan Jiang (RENCI)Mert Cevik (RENCI)Emily Casanova (USC-GHS)Manual Casanova (USC-GHS)Alex Bowers (Columbia U.)Josh Vandenbrink (Louisiana Tech)Ann Loraine (UNCC)Colleen Doherty (NCSU)John Graham (UCSD)Wallace Chase (REANNZ)Christos Papadopoulos (CSU)Susmit Shannigrahi (CSU)Chengyu Fan (CSU)Mike Shepard (Cisco)Mike Kowal (Cisco)

Many many more

Page 24: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

● “CC*Data: National Cyberinfrastructure for Scientific Data Analysis at Scale (SciDAS)NSF-CC* [1659300] (A. Feltus PI)●“Tripal Gateway: Platform for Next-Generation Data Analysis and Sharing.”Source: NSF-DIBBS [1443040] (S. Ficklin, PI)● “MCA-PGR: Spatial and Temporal Resolution of mRNA Profiles During Early Nodule Development.”Source: NSF-PGRP [1444461] (J. Frugoli PI)

● “BIGDATA: F: DKM: Collaborative Research: PXFS: ParalleX Based Transformative I/O System for Big Data”Source: NSF-BIGDATA [1447771] (W. Ligon PI)● “Genomic and Breeding Foundations for Bioenergy Sorghum Hybrids.”Source: Plant Feedstock Genomics for Bioenergy [DE-FOA-000041] (S Kresovich, PI). ● “Big Data Visualization REU”. Source: National Science Foundation [1359223](V Byrd, PI)● “MRI: Acquisition of a High Performance Computing Instrument for Collaborative Data-Enabled Science.” Source: National Science Foundation [1228312] (A Apon, PI)● “CC-NIE Integration: Clemson-NextNet”Source: National Science Foundation [1245936] (KC Wang, PI)● “Building non-model species genome curation communities.”Source: National Evolutionary Synthesis Center (NESCent) (A Papanicolaou, PI)● “Big Data Analysis Tools for Agricultural Genomics.” Source: Clemson University Experiment Station (USDA Hatch Project) [SC-1700492] (Feltus, PI).

Thank You Funding Agencies!!!!!

Page 25: F. Alex Feltus, Ph.D....2019/03/06  · Just out of the Box: Slurm SciApp A collection of Docker images that instantiates a CentOS-based Slurm cluster. 1 head, 1 database, N compute

Genomics Scale Up Observations

Issues:::Solutions• Unpredictable time to compute result (queue times, queue times, queue

times, broken nodes, segfaults, OOM, data geography, short walltimes) :::Software optimization; Real Parallel + Redneck Parallel Computing on GPUs/CPUs; SciDAS

• Not enough computational resources:::OSG, XSEDE, PRP/TNRP, SciDAS, negotiated Cloud credits

• Not enough in-lab ACI knowledge::: IT Engineer Lunch Dates, Governance committees, Research Facilitators, Software Carpentry, Collaborations: CS/CE/Engineering Departments/NRT

• Not enough storage:::Shared Data Grids; Negotiate cheaper storage with campus IT; Move to Cloud; Leverage /scratch space for intermediate files

• Poor use of advanced networks:::Avoid Commercial Internet; Perform data life cycle analysis and push data close to network; Data caching

• Data Organization:::iRODs DataGrid; Tripal Databases; NDN/hICN

Prediction: Giga-/Tera scale genomics experiments will move into the peta-/exa scale in this PhD generation.