F. Alex Feltus, Ph.D....2019/03/06 · Just out of the Box: Slurm SciApp A collection of Docker...
Transcript of F. Alex Feltus, Ph.D....2019/03/06 · Just out of the Box: Slurm SciApp A collection of Docker...
It takes [gene, dtn, human] networks to understand biological systems
F. Alex Feltus, Ph.D.Clemson University Dept. of Genetics & Biochemistry (Professor)
CU-MUSC Biomedical Data Science & Informatics Program (Member)CU Center for Human Genetics (Member)
Allele Systems LLC (CEO)Internet2 Board of Trustees (Member)
[email protected] Global Summit @ March 7 2018 1.15pm EST
Angiosperms
My BIOLOGY Lab = 1/3 Animal; 1/3 Plant; 1/3 Computational
Vertebrates
Bioinformatics/ Cyberinfrastructure
Biological Measurements are Getting Too Big For Paper and Laptops
DNA Sequencing Costs Are Dropping DNA Databases are Swelling
Source: NCBI SRASource: NIH
20th Century Storage Rack
21st CenturyDatastore
25.7 PBs in 10 yrs
Quick Overview of Genomics
774,746 Words26 letter alphabet
58,721 Words (Genes)4 letter alphabet (ATGC)
John 3:16: For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life.
John 3:16*: For God so loved the world that he gave his one and only Sun, that whoever believes in him shall not perish but have eternal life.One letter change = confused congregation!
One letter change = higher risk for breast and ovarian cancer!
AATGGAGCCACATAACACATTCAAACTTACTTGCAAAATAT
AATGGAGCCACATAACACATGCAAACTTACTTGCAAAATAT
What is a genome? What is a gene?Human Genome “Facts”Initiation instructions for you!(3.2 billion*2) ATGC23 chromosome pairs (23andME!)24 unique chromosomes (1-22, X, Y) 24 strings (Chr1 = 250 million ATGCN)58,721 gene intervals
GeneGGTTTCTATCTTACTAAAGGATATTTCAGAAAATCTATATTCACTGAGGAGGATGATAATTGGGTCTACTAACATTGAGACTGAACTGAGGCCCAGCAATAATTTAAACTTATTATCCTTTGAAGATTCAACTACTGGGGGAGTACAACAGAAACAAATTAGAGAACATGAAGTTTTAATTCACGTTGAAGATGAAACATGGGACCCAACACTTGATCATTTAGCTAAACATGATGGAGAAGATGTACTTGGAAATAAAGTGGAACGAAAAGAAGATGGATTTGAAGATGGAGTAGAAGACAACAAATTGAAAGAGAATATGGAAAGAGCTTGTTTGATGTCGTTAGATATTACAGAACATGAACTCCAAATTTTGGAACAGCAGTCTCAGGAAGAATATCTTAGTGATATTGCTTATAAATCTACTGAGGTACTAAATAAAGAGGAAGCACATTTTTAGTTATTAGTAGGTTCTGGCAGACTTTATTCCCGTAAAGAGACAGATAGTAAATATTTTAGGCTTTGTGGGACATACAGTCTCTGTTGCAGTAACTCAGTTCTGTTATTGTAGTGTGAAAGCAGCCACAGACAAATGTAAACAAATGTGCCTGTCTTTCAATAAAACTTTATCTACAAATACAGATAGTGGGCCACATTTGGCTTGCGAGCTGTAGTTTGCTGTTCCTGACTTAGTATGATATGCAGCATTGTGTAGTGGAAAGATTTTATAGTATCAACAGACATTTGTAGGAGTTTGCTATTTTATAATTTCGTATGCTTTTTTCATAATAGATGTCTCTGCTTTTTGTGGATTCTCTTCATATCTACTAGTGTTTACTCTTCTGACTTTGATTTTTGGGGGTGTATTTTATTTCTTTATACAGAATTTTTTTCTTTTTTCTTTCTTTTTTTGAGACAAGGTTTTGCTGTGaTCACCCCAGCTGGAGTGCAGTGGTGGGATCTTGGCCCCCTGCGGTCTTCAACTCCTGGGCTTAAGCGATCCTGTCACCTCAGCCTCCCTAGTAGCTGGGACCACAGGCCTGCACCACTACACTCAGCTAATTTTTGTATTTTTTGTAGAGATGGGGTTTCGCTCTGTTGCCCAGACTGGTCTCCAACTCCTGGACTCAAGTGATCCACCTGCCTTGGCCTCCCAAAGTGTTGGAATTATAGGTGTGAGTCACTGTGCCTGGCCCAGAATTTCTTTACTTGCTCATGTGAGAGAGTTTCTAATAAACTCCTGATGCTTTTGCTTGACACTATAGTTTTACTGTCATCTCAGAGTCAGATTTTTTCCTTAATTTCTTCATTAAATTGATTTCTTAATTTCTCATTTGTTTTGCATTAGTTTACAAAATTACTTAAGATGCCTATATTCTCTGAAATTTTATATTTTATTTCACGATAAATAAGTTTAAAAAATTAGATTTGACCATTTAATTAGAGTTATTTCCCCATCATTTTTCTAATGGCAACAATAATTTTAGCATTTGAGCAATTCTCTGTATTTTGGTAACTATTAGTAATCACTTTTATCAGTATAAACATGCATTTCATTTTAAAATGGTTTTCAAAATATTCTCATACCAACCATTCATAAAAGATGTTCCCCTTCTGGGAAATATCAGTTTTTGCTAACATCTGCATTTCCCTTTGATTTTCTAGGAATTTCTTATTAGGTATACACTTTAAGTGATATGTTTCATATAATATTATAGGATTATCAGGATTTCATGATTTTTTTTTTTTGAGATGGAATCTATCTCTGTCGCCCAAGCTGGAGTGCCGTGGCGCAGTCTTGGCTCACTACAACCTCTGCCTCCCCGGTTCAAGTGATTCTCCTACCTCAGCCTCCCGAGTCACTGGGATTACAGGTGCCCGCCACCACACCCGGCTAATTTTTGTATTTTTAGTAGAGAAAGTGCTGGGATTACAGGTGTGAGCCACTGCACCTGGCCTTGGATTTCGTGATTTGACTGACATCTTTTACACCAATTATTGCCATTTTTTTATTGGGTACTAATTATTAGTATTGCTTTCTTTATTCGTTAAATAAGAAGTAAAAGAACTTTTAAAAAATAAAATAACTTTTTTTCCTAGTACAAAGCTTATTTTCAATTTTTTTTATTGCTTAAATTTTGCAGGTTAATTAAGGCTGTACTTCTTTTTCCTACTTCCTGTTTAAATATTTGGCATTTTTTGAGGGTTTTCTTTCTATGGATTTTTGTGGGGGATCCCATCTCGTTGGACACTTGGTTCCTGGAGATTGCCAAATTATTGAAAAGTTTCCTCATACCCAAACATTGGCAATAGACTGAATAAATAACATCTCTGACATTATTTTTCAATTATATAGAGAATGCTTTTTGTAAAATACTGATTTATTTAAAGATATTACAATCATAGTAGTTTTTTGCAATCTTTTTTAAATCAGTGGATGTCAAGGAACAACTATTGTTCTTTGTACGGCATTTTTCAACATTGTGAAGAGTTCTTATACTCAAAAAGTTTGGGAGACACAGTGTGACATTGTTTAACCTGAGTTGAACTTGTCATTTTGTATTCTTGTTTAGAGCAGGGATTTTATTTTGTCTCTTTATCTAACTTTGTATTCCTAACTATTTCTTTCTTTCTCTTTGGTTCTTTGTTTCTTTGTTCATTCTTCTCTCTCTCCCTTTTTTATCTATCATATAAATATGTAAATATATATTATATATCCATTAGGGAAGAGGAAGTTGTCTAAAGATATCTAGTATATAGGAGCTTTGTTCTGCAGAAAAGAGGATATGAAGTCAATTATATTGGAAATTAATGCTTAATACTTTTTTTAAAGCATTTATCTCCCAATGATAATGAAAACGATACGTCCTATGTAATTGAGAGTGATGAAGATTTAGAAATGGAGATGCTTAAGGTATGTTTACAATTATAAAAATATTACTTCAAGTTCTTTCCAAAGGACATTTAATTAAGTAAAATATTAACTAATTCTAAACTAGGTTCTACCACAATGAAATTGCTACTAATTATGTAACATTAGATTTCACATTTTCCAATTCATGTTTCTTTCATGTAGTCTATAAATAATGGGTTAGAGGTAATTTACTAATTTTAAATGTGCTTTCCTTGTTCCTTCTTATTTTTTATTTTGAAGACAGGGTCTCACTCTATCACCCAGGCTGGAGTGCAGTTGCTCCATCTCGGCTCACTGCAACCTTCACCTCCTGGGCTCAAGTGATCCTCCTGCATAAGCCTCCCGAGTAGCTGGGATTATGGGCGTGCACCACCATGCCCGGCTAATTTTTATAGTTTTAGTAGAGACAGGGTCTCACCATGTTGTCCTTGCTGGTCTCGAACTCTTTACCTCAAGTAATCCACCCGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACCACACCTGGCCTTGGATTTCATGATTTGACTGACATCTTTTACACAAATTATTGCCATTTTTTTATTGGGTACCAATTATTAGTATTGTTTTCTTTATTCGTTAAATAAGAAGTAAAATAACTTTTTTTTAAAAAAAGAACTTTTTTCCTAGTAAAAAGCTTATTTTCAATTTTTTTATTGCTTAAATTTTGCAGGTTAATTAAGGCTGTGCTTCCTTTTCCTACTTTCTGTTTAAATGTTTGGCATTTTTTGAGGGTTTTCTTTCTATGGATTTTTGTGGGGGATCCCATCTTGTTGGACACTTGGTTCCTGGAGATTGCCAAATTATTGAAAAGTTTCCTCATACCCAAACATTGGCAGTAGACTGAATAAATAACATCTCTGACACTATTTTTCAATTATATAGAGAATGCTATTTGTAAAATACTGATTTATTTAAAGATATTACAGTCAGTAGTTTTTTGCAATCTTTTTAAAATCAGTGGATGTCAAAAACAGCTATTGTTCTTTGTACAGCATTTTTCAACATTGTGAAGAGTTCTTATACTCAAAAAGTTTGGGAGACACAGTTTGATATTGTTTAACCTGAGTTGAACTTGTCATTTTGTATTCTTGTTTAGAGCAGGGATTTTATTTTGTCTCTTTATCTAACTTTGTATTCCTATTTCTTTCTTTCTCTTTGGTTCTTTGTTTCTTTGTTCATTCTTCTCTCTCTCCCTTTTTTATCTATCATATAAATATGTAAATATATATTATATATCCATTAGGGAAGAGGAAGCTGTCTAAAGATATCTAGTATATAGGAGCTTTGTTCTGCAGAAAAGAGGATATGAAGTCAATTATATTGGAAATTAATGCTTAATACTTTTTTTTAAAGCATTTATCTCCCAATGATAATGAAAACGATACGTCCTATGTAATTGAGAGTGATGAAGATTTAGAAATGGAGATGCTTAAG
Chromosome Information is Accessed with Enzymes;Blu Ray Information is Accessed with Lasers
http://www.bigbrownboxblog.com.auhttp://www.rcsb.org/pdb/101/motm.do?momID=40
~The DNA molecule can be copied (Replication), transcribed (Transcription), and converted into protein (Translation)
DNA
PROTEIN
Change DNA and Change The Person: Coffee Example
https://www.23andme.com/health/Caffeine-Metabolism/
TGTGGGCACAGGAC
TGTGGGCACAGGAC
TGTGGGCCCAGGAC
TGTGGGCACAGGAC
TGTGGGCCCAGGAC
TGTGGGCCCAGGAC
The sequence of DNA contains information on how an organism responds to the world.
Public Data Source: The Cancer Genome Atlas
https://portal.gdc.cancer.gov/
PubMed “the cancer genome atlas” yields 5,470 publications
In April 2018, The Cancer Genome Atlas (TCGA) Research Network marked the end of the TCGA program by publishing the Pan-Cancer Atlas Exit Disclaimer: a collection of cross-cancer analyses delving into overarching themes on cancer, including cell-of-origin patterns, oncogenic processes and signaling pathways. The data remains available to the public for further mining through the Genomic Data Commons. In light of this milestone, the following events will provide further discussion of our current understanding of the disease, how basic research is changing patient treatment, and the future of multi-omic studies in cancer.
We mine genomics data for complex gene expression patterns
William L. Poehlman, James Hsieh, and F. Alex Feltus. “Linking Binary Gene Relationships to Drivers of Renal Cell Carcinoma Reveals Convergent Function in Alternate Tumor Progression Paths.” Scientific Reports (in press)
Kidney Cancer Biomarker Network
~1000 kidney tumors
What are the complex genetic interactions underlying Autism and other IDs?
Emily L. Casanova, Zachary Gerstner, Julia L. Sharp, Manuel F. Casanova, F. Alex Feltus. “Widespread Genotype-Phenotype Correlations in Intellectual Disability.” Frontiers in Psychiatry. 29;9:535.https://doi.org/10.3389/fpsyt.2018.00535, 2018.
SPARK (Simons Foundation Powering Autism Research for Knowledge)
{Sensitive human data that needs HPC}
https://sparkforautism.org/
We like plants too!
https://genelab.nasa.gov/about/
Josh Vandenbrink@ Louisiana TechJohn Kiss@UNC-Greensboro
Julia Frugoli lab@ Clemson University
Organism: Medicago truncatulaTrait: In plant fertilizer production.
http://lasernode.org/
Organism: Arabidopsis thalianaTrait: <1g plant development
My Group Processes Data at the Petascale on Democratized Systems
SciDAS (Scientific Data Analysis at Scale): NSF CC*
DistributedCloud
Computing
In 2017 on OSG …8.43 Million Wall Hours
(962 years on a laptop)4.50 Million CPU Hours8.92 Million Jobs16.6 Million Transfers4.07 PB
PRP/TNRP Kubernetes Cluster
Clemson Palmetto Cluster
11/2018 Top500
FASTQraw
FASTQclean
Gene Expression Matrix (GEM)
BAM
Public DatabasesNCBI; EBI; GDC
ImageAnalysis
Preprocessing
GenomeAlignment
BDSS
ReferenceGenome
GFF3
Counting Normalize
NDN
DNA
Diverse Downstream WorkflowsGene Expression QuantificationDifferential Gene Expression AnalysisDNA Polymorphism DiscoveryBiomarker AnalysisGene Co-expression Analysis
Data Grids Feed HPC workflows with Tera-/Petabytes of DNA Data
iRODs
hICN
Virtualization System Metadata to encode rich information
Rule engine programmed with rules to enact policies
Data Federation
iRODS Distributed Datagrid• Integrated Rule Oriented Data System (iRODS)
provides a distributed unified namespace over SciDAS storage infrastructure across Clemson, RENCI and WSU (4.2 petabyte Data Grid;1232 indexed genomes; GEM-GCN Storage)
• iRODS provides enable policy-driven management critical to data-sharing collaborations in SciDAS
Terrell Russell, Michael Stealey, Jason Coposky, Ben Keller, Claris Castillo, Ray Idaszak, Alex Feltus. Distributing the iRODS Catalog: A Way Forward. iRODSUGM 2017 Proceedings. Page 35, 2017.
Pushing Genomics Data onto Named Data Networking (NDN) Framework• 1,232 Genomes pulled from iRODS-SciDAS Data Grid (NSF CC*1659300)
• >500 Gigabytes in Aggregate (tar-gz compressed)• WSU indexed; CSU-CU moved to NDN• Naming: ‘Genus->Species->Intraspecies->Assembly->Files’• NDN Caching near Genomics Workflows!
Hierarchical Named Data Format[genus][species]{[infraspecific name]}/[assembly_name] that contains these files:> [genus]_[species]{_[infraspecific name]}-[assembly_name].{n}.ht2> [genus]_[species]{_[infraspecific name]}-[assembly_name].gff3> [genus]_[species]{_[infraspecific name]}-[assembly_name].fasta> [genus]_[species]{_[infraspecific name]}-[assembly_name].gtf> [genus]_[species]{_[infraspecific name]}-[assembly_name].Splice_sites> [genus]_[species]{_[infraspecific > name]}-[assembly_name].meta.json
/scidasZone/sysbio/PynomeGenomes/Genome/Zymoseptoria_tritici/MG2:> Zymoseptoria_tritici-MG2.1.ht2> Zymoseptoria_tritici-MG2.2.ht2> Zymoseptoria_tritici-MG2.3.ht2> Zymoseptoria_tritici-MG2.4.ht2> Zymoseptoria_tritici-MG2.5.ht2> Zymoseptoria_tritici-MG2.6.ht2> Zymoseptoria_tritici-MG2.7.ht2> Zymoseptoria_tritici-MG2.8.ht2> Zymoseptoria_tritici-MG2.fa> Zymoseptoria_tritici-MG2.gff3> Zymoseptoria_tritici-MG2.gtf> Zymoseptoria_tritici-MG2.meta.json> Zymoseptoria_tritici-MG2.Splice_sites
Genome Example
Fungal Genet Biol. 2015 Jun; 79: 17–23.
Christos Papadopoulos (CSU)Susmit Shannigrahi (CSU)Chengyu Fan (CSU)Stephen Ficklin (WSU)
https://named-data.net/
NDN Data Discovery Via a Web Based UI and Moved (or copied from cache) to an Endpoint on the Network
http://atmos-sac.es.net/genome/
We are exploring ways to move huge genomics data onto I2/Cisco Content-Centered Networking (CCN) Testbed using Hybrid Information Centered Networking (hICN)
https://www.cisco.com/c/dam/en_us/solutions/industries/docs/education/information-centric-networking-education.pdf
https://fd.io/2019/02/introducing-hybrid-information-centric-networking-hicn-a-new-fd-io-project/
CCN Mailing List CiscoInternet2ClemsonColorado StateFlorida International UniversityGeorge Washington NISTNorthwesternRutgersU. MichiganU. UtahU. WashingtonUC-Santa CruzUT-San Antonio
SciDAS Ecosystem: CI, clouds and community platforms
SciApps: Towards Reproducible Science• Scientific applications are available in the form of
SciApps “virtual appliances” (CC-ADAMANT)• Borrowed concept from ’virtual appliance’, i.e., virtual machine
image• A SciApp is configured with the application software needed to
reproduce an experiment with the highest fidelity possible• A SciApp may consist of multiple containers spanning over a
virtual network across multiple clouds and CI facilities • Parameterizable templates will be provided with same
defaults to meet the needs of the scientists[CC-ADAMANT] Enabling Workflow Repeatability with Virtualization Support, Fan Jiang et.al. Workshop on Workflows of Large-Scale Science, Supercomputing Conference (SC15), Austin, Texas,2015.
{"id": ”pegasus-htc","containers": [{
"id": "submitter","image": "scidas/kinc-submitter","resources": { "cpus": 2, "mem": 4096,"disk": 10240},"cluster": "chameleon","port_mappings": [{"container_port": 22, "host_port": 0,"protocol": "tcp"}],"args": [
"-f","chameleon-master,aws-master,azure-master", "-k", "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC303e2y8aUaMQ1IkHWnGFyb5XykxOM5pLK83XFxWZMKsbYcgmkoODZ4w4COratlQPyMXSz7yaFUbYUccXjIjz8SDZf/9c3xI0UuILOiVfb5Ql/dsfssgsfvxcvfdsss321nksnvsnvlkvlksvkkdkddvlkssvn/xk+TORZYK3CE3Oqu9p77nrFM7W3M5khsb5Qg/z0W1TQmVWvo5/i3QbDK6YaWhw/0DXjfCeEtdlTVdIq1EJxMWuJnm5IptB1EtG9GBhuHq5Ct2XkUh",
"-u”, "irodsuser","-p",“fdsfsfsdczxv3rr3r","-h","irods-renci.scidas.org", "-z", “irodsZone"
]},{
"id": "chameleon-master","image": "scidas/htcondor-worker-centos7:1","cluster": "chameleon","resources": {"cpus": 48,"mem": 49152,"disk": 10240
},
Input Output
HTCondor-Pegasus SciApp
Just out of the Box: Slurm SciApp
A collection of Docker images that instantiates a CentOS-based Slurm cluster.● 1 head, 1 database, N compute nodes.● Uses GlusterFS as a shared file system(NFS in
development).● Utilizes Lmod module system to provide a
dynamic execution environment.● Currently configured to run any Nextflow, CWL, or
Toil workflows.Users get a personalized HPC cluster with root access!
CloudyCluster
cloudycluster.com
Feltus LabYuqing Hang (<PhD, G&B)Benafsh Husain (<PhD co-train, BDSI)Allison Hickman (<PhD, G&B)Ben Shealy (<PhD co-train, ECE)Colin Targonski (<MSc, co-train, ECE)Yueyao Gao (<PhD, G&B)Rachel Eimen (<Bsc, ECE)Courtney Shearer (<BSc, CS)Cole McKnight (<Bsc, CS)Jordan Little (<BSc, G&B)Ethan Bensman (<BSc, G&B)Reed Bender (<BSc, G&B)
Recent alumni*Will Poehlman (PhD, G&B)*Melissa Judge (<BSc, Bioengineering)*Keerti Kosana(<BSc, CS)*Henry Randall (<Bsc, Bioengineering)*Leland Dunwoodie (<MD, G&B)*Olivia Feltus (<BSc, Intern)*Nick Watts (Programmer, CCIT)*Zach Gerstner (<MS, Microbiology)*Jack Fletcher (<Bsc, REU)*Kim Roche (<PhD, CCIT, G&B)*Brittany Rosener (BSc, G&B)*Michael Sullivan (<BSc, G&B)
The most important network is people!@ ClemsonKaran Sapra (ECE)Melissa Smith (ECE)KC Wang (ECE/CCIT)Walt Ligon (ECE)Nick Mills (ECE)John Calhoun (ECE)Brian Dean (CS)Marc Birtwistle (ChemE)Julia Frugoli (G&B)Suchitra Chavan (G&B)Elsie Schnabel (G&B)Susan Duckett (AVS)Jessi Britt (AVS)Markus Miller (AVS)Stephen Kresovich (PES)Zach Brenton (PES)Randy Martin (CCIT)Corey Ferrier (CCIT)Jim Pepin (CCIT)Clemson Networking (CCIT)Clemson CITI (CCIT)
@ EarthStephen Ficklin (WSU)Josh Burns (WSU)Tyler Biggs (WSU)Dorrie Main (WSU)Sook Jung (WSU)Joe Breen (Utah)Jill Wegrzyn (UCONN)Meg Staton (UT-Knoxville)Jim Bottum (Internet2)Ana Hunsinger (Internet2)Marvin Weinstein (Quantum Insights LLC)Ken Matusow (Synergity)Don Preuss (Starfish Storage)
@ EarthMats Rynge (USC-OSG)Bala Desinghu (U Chicago-OSG)Andrew Paterson (UGA)Claris Castillo (RENCI)Ray Idaszak (RENCI)Paul Ruth (RENCI)Michael Stealy (RENCI)Fan Jiang (RENCI)Mert Cevik (RENCI)Emily Casanova (USC-GHS)Manual Casanova (USC-GHS)Alex Bowers (Columbia U.)Josh Vandenbrink (Louisiana Tech)Ann Loraine (UNCC)Colleen Doherty (NCSU)John Graham (UCSD)Wallace Chase (REANNZ)Christos Papadopoulos (CSU)Susmit Shannigrahi (CSU)Chengyu Fan (CSU)Mike Shepard (Cisco)Mike Kowal (Cisco)
Many many more
● “CC*Data: National Cyberinfrastructure for Scientific Data Analysis at Scale (SciDAS)NSF-CC* [1659300] (A. Feltus PI)●“Tripal Gateway: Platform for Next-Generation Data Analysis and Sharing.”Source: NSF-DIBBS [1443040] (S. Ficklin, PI)● “MCA-PGR: Spatial and Temporal Resolution of mRNA Profiles During Early Nodule Development.”Source: NSF-PGRP [1444461] (J. Frugoli PI)
● “BIGDATA: F: DKM: Collaborative Research: PXFS: ParalleX Based Transformative I/O System for Big Data”Source: NSF-BIGDATA [1447771] (W. Ligon PI)● “Genomic and Breeding Foundations for Bioenergy Sorghum Hybrids.”Source: Plant Feedstock Genomics for Bioenergy [DE-FOA-000041] (S Kresovich, PI). ● “Big Data Visualization REU”. Source: National Science Foundation [1359223](V Byrd, PI)● “MRI: Acquisition of a High Performance Computing Instrument for Collaborative Data-Enabled Science.” Source: National Science Foundation [1228312] (A Apon, PI)● “CC-NIE Integration: Clemson-NextNet”Source: National Science Foundation [1245936] (KC Wang, PI)● “Building non-model species genome curation communities.”Source: National Evolutionary Synthesis Center (NESCent) (A Papanicolaou, PI)● “Big Data Analysis Tools for Agricultural Genomics.” Source: Clemson University Experiment Station (USDA Hatch Project) [SC-1700492] (Feltus, PI).
Thank You Funding Agencies!!!!!
Genomics Scale Up Observations
Issues:::Solutions• Unpredictable time to compute result (queue times, queue times, queue
times, broken nodes, segfaults, OOM, data geography, short walltimes) :::Software optimization; Real Parallel + Redneck Parallel Computing on GPUs/CPUs; SciDAS
• Not enough computational resources:::OSG, XSEDE, PRP/TNRP, SciDAS, negotiated Cloud credits
• Not enough in-lab ACI knowledge::: IT Engineer Lunch Dates, Governance committees, Research Facilitators, Software Carpentry, Collaborations: CS/CE/Engineering Departments/NRT
• Not enough storage:::Shared Data Grids; Negotiate cheaper storage with campus IT; Move to Cloud; Leverage /scratch space for intermediate files
• Poor use of advanced networks:::Avoid Commercial Internet; Perform data life cycle analysis and push data close to network; Data caching
• Data Organization:::iRODs DataGrid; Tripal Databases; NDN/hICN
Prediction: Giga-/Tera scale genomics experiments will move into the peta-/exa scale in this PhD generation.