A Pervasive Technology Institute Center What is The National Center for Genome Analysis Support?...

1
a Pervasive Technology Institute Center What is The National Center for Genome Analysis Support? NCGAS is a national center dedicated to providing scientists easy and ready access to the software and supercomputers necessary for the important work of genomics research. Initially funded by the National Science Foundation Advances in Biological Informatics (ABI) program, grant # 1062432 Provides access to memory rich supercomputers customized for genomics studies, including Mason and XSEDE other systems. A Cyberinfrastructure Service Center affiliated with the Pervasive Technology Institute at Indiana University ( http://pti.iu.edu ) Provides distributions of hardened versions of popular codes Particularly dedicated to genome assembly software such as: de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS consensus methods: Celera, Arachne 2 For more information, see http://ncgas.org Mason – a HP ProLiant DL580 G7 provided by NCGAS 16 node cluster 10GE interconnect Cisco Nexus 7018 Compute nodes are oversubscribed 4:1 This is the same switch that we use for DC and other 10G connected equipment. Quad socket nodes 8 core Xeon L7555, 1.87 GHz base frequency 32 cores per node 512 GByte of memory per node! Rated at 3.383 TFLOPs (G-HPL benchmark) STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms NCGAS Sandbox Demo at Supercomputing 11 Early Users: Metagenomics Sequence Analysis Yuzhen Ye Lab (IU Bloomington School of Informatics) Early Users: Genome Assembly and Annotation Michael Lynch Lab (IU Bloomington, Department of Biology) Assembles and annotates Genomes in the Paramecium aurelia species complex in order to eventually study the evolutionary fates of duplicate genes after whole-genome duplication. This project also has been performing RNAseq on each genome, which is currently used to aid in genome annotation and subsequently to detect expression differences between paralogs. The assembler used is based on an overlap-layout-consensus method instead of a de Bruijn graph method (like some of the newer assemblers). It is more memory intensive – requires performing pairwise alignments between all pairs of reads. The annotation of the genome assemblies involves programs such as GMAP, GSNAP, PASA, and Augustus. To use these programs, we need to load-in millions of RNAseq and EST reads and map them back to the genome. Early Users: Genome Informatics for Animals and Plants Genome Informatics Lab (IU Bloomington Department of Biology) This project is to find genes in animals and plants, using the vast amounts of new gene information coming from next generation sequencing technology. These improvements are applied to newly deciphered genomes for an environmental sentinel animal, the waterflea (Daphnia), the agricultural pest insect Pea aphid, the evolutionarily interesting jewel wasp (Nasonia), and the chocolate plant (Th. cacao) which will bring genomics to sustainable agriculture of cacao. Large memory compute systems are needed for biological genome and gene transcript assembly because assembly of genomic DNA or gene RNA sequence reads (in billions of fragments) into full genomic or gene sequences requires a minimum of 128 GB of shared memory, more depending on data set. These programs build graph matrices of sequence alignments in memory. Early Users: Imputation of Genotypes And Sequence Alignment Tatiana Foroud Lab (IU School of Medicine, Medical and Molecular Genetics) Study complex disorders by using imputation of genotypes typically for genome wide association studies as well as sequence alignment and post-processing of whole genome and whole exome sequencing. Requires analysis of markers in a genetic region (such as a chromosome) in several hundred representative individuals genotyped for the full reference panel of SNPs, with extrapolation of the inferred haplotype structures. More memory allows the imputation algorithms to evaluate haplotypes across much broader genomic regions, reducing or eliminating the need to partition the chromosomes into segments. This increases the accuracy and speed of imputed genotypes, allowing for improved evaluation of detailed within-study results as well as communication and collaboration (including meta-analysis) using the disease study results with other researchers. Early Users:Daphnia Population Genomics Michael Lynch Lab (IU Bloomington Department of Biology) This project involves the whole genome shotgun sequences of over 20 more diploid genomes with genomes sizes >200 Megabases each. With each genome sequenced to over 30 x coverage, the full project involves both the mapping of reads to a reference genome and the de novo assembly of each individual genome. The genome assembly of millions of small reads often requires excessive memory use for which we once turned to Dash at SDSC. With Mason now online at IU, we have been able to run our assemblies and analysis programs here at IU. http://ncgas.org IU's NCGAS partners include the Texas Advanced Computing Center (TACC) and the San Diego Supercomputer Center (SDSC), and will support software running on supercomputers at TACC and SDSC, as well as other supercomputers available as part of XSEDE (the new NSF-funded Extreme Science and Engineering Discovery Environment). NCGAS will further campus-based integration, known as "campus bridging." Thomas G. Doak ([email protected]), Le-Shin Wu, Craig A. Stewart, Robert Henschel, William K. Barnett A specific goal is to provide dedicated access to large memory supercomputers, such as IU's new Mason system. Each Mason compute node has 512GB of random access memory, critical for data-intensive science applications such as genome assembly. Environmental sequencing Sampling DNA sequences directly from the environment Since the sequences consists of DNA fragments from hundreds or even thousands of species, the analysis is far more difficult than traditional sequence analysis that involves only one species. Assembling metagenomic sequences and deriving genes from the dataset Dynamic programming to optimally map consecutive contigs from the assembly. Since the number of contigs is enormous for most metagenomic dataset, a large memory computing system is required to perform the dynamic programming algorithm so that the task can be completed in polynomial time. For Indiana University’s Supercomputing 11 research sandbox demo, NCGAS implemented a biological application to simulate a sequence alignment and SNP (single nucleotide polymorphism) identification pipeline (shown above). The goal is to demonstrate that, with a network bridging between NCGAS computing nodes at IU and a remote storage file system, we are able to conduct a data intensive pipeline without repetitive data file movement.

Transcript of A Pervasive Technology Institute Center What is The National Center for Genome Analysis Support?...

Page 1: A Pervasive Technology Institute Center What is The National Center for Genome Analysis Support? NCGAS is a national center dedicated to providing scientists.

a Pervasive Technology Institute Center

What is The National Center for Genome Analysis Support?

• NCGAS is a national center dedicated to providing scientists easy and ready access to the software and supercomputers necessary for the important work of genomics research.

• Initially funded by the National Science Foundation Advances in Biological Informatics (ABI) program, grant # 1062432

• Provides access to memory rich supercomputers customized for genomics studies, including Mason and XSEDE other systems.

• A Cyberinfrastructure Service Center affiliated with the Pervasive Technology Institute at Indiana University (http://pti.iu.edu)

• Provides distributions of hardened versions of popular codes

• Particularly dedicated to genome assembly software such as:

• de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS

• consensus methods: Celera, Arachne 2

• For more information, see http://ncgas.org

Mason – a HP ProLiant DL580 G7 provided by NCGAS

• 16 node cluster• 10GE interconnect–Cisco Nexus 7018–Compute nodes are oversubscribed 4:1–This is the same switch that we use for DC and

other 10G connected equipment.• Quad socket nodes–8 core Xeon L7555, 1.87 GHz base frequency–32 cores per node–512 GByte of memory per node!

• Rated at 3.383 TFLOPs (G-HPL benchmark)

• STEP 1: data pre-processing, to evaluate and improve the quality of the input sequence

• STEP 2: sequence alignment to a known reference genome

• STEP 3: SNP detection to scan the alignment result for new polymorphisms

NCGAS Sandbox Demo at Supercomputing 11

Early Users: Metagenomics Sequence Analysis

Yuzhen Ye Lab (IU Bloomington School of Informatics)

Early Users: Genome Assembly and Annotation

Michael Lynch Lab (IU Bloomington, Department of Biology)

• Assembles and annotates Genomes in the Paramecium aurelia species

complex in order to eventually study the evolutionary fates of duplicate genes

after whole-genome duplication. This project also has been performing RNAseq

on each genome, which is currently used to aid in genome annotation and

subsequently to detect expression differences between paralogs.

• The assembler used is based on an overlap-layout-consensus method instead

of a de Bruijn graph method (like some of the newer assemblers). It is more

memory intensive – requires performing pairwise alignments between all pairs

of reads.

• The annotation of the genome assemblies involves programs such as GMAP,

GSNAP, PASA, and Augustus.  To use these programs, we need to load-in

millions of RNAseq and EST reads and map them back to the genome.  

Early Users: Genome Informatics for Animals and Plants

Genome Informatics Lab (IU Bloomington Department of Biology)

• This project is to find genes in animals and plants, using the vast amounts of new gene information coming from next generation sequencing technology.

• These improvements are applied to newly deciphered genomes for an environmental sentinel animal, the waterflea (Daphnia), the agricultural pest insect Pea aphid, the evolutionarily interesting jewel wasp (Nasonia), and the chocolate plant (Th. cacao) which will bring genomics to sustainable agriculture of cacao.

• Large memory compute systems are needed for biological genome and gene transcript assembly because assembly of genomic DNA or gene RNA sequence reads (in billions of fragments) into full genomic or gene sequences requires a minimum of 128 GB of shared memory, more depending on data set. These programs build graph matrices of sequence alignments in memory.

Early Users: Imputation of Genotypes And Sequence Alignment

Tatiana Foroud Lab (IU School of Medicine, Medical and Molecular Genetics)

• Study complex disorders by using imputation of genotypes typically for

genome wide association studies as well as sequence alignment and

post-processing of whole genome and whole exome sequencing.

• Requires analysis of markers in a genetic region (such as a

chromosome) in several hundred representative individuals genotyped

for the full reference panel of SNPs, with extrapolation of the inferred

haplotype structures.

• More memory allows the imputation algorithms to evaluate haplotypes

across much broader genomic regions, reducing or eliminating the need

to partition the chromosomes into segments. This increases the

accuracy and speed of imputed genotypes, allowing for improved

evaluation of detailed within-study results as well as communication and

collaboration (including meta-analysis) using the disease study results

with other researchers.

Early Users:Daphnia Population Genomics

Michael Lynch Lab (IU Bloomington Department of Biology)

This project involves the whole genome shotgun sequences of over 20 more diploid genomes with genomes sizes >200 Megabases each. 

•With each genome sequenced to over 30 x coverage, the full project involves both the mapping of reads to a reference genome and the de novo assembly of each individual genome.

•The genome assembly of millions of small reads often requires excessive memory use for which we once turned to Dash at SDSC. With Mason now online at IU, we have been able to run our assemblies and analysis programs here at IU.

http://ncgas.org

IU's NCGAS partners include the Texas Advanced Computing Center (TACC) and the San Diego Supercomputer Center (SDSC), and will support software running on supercomputers at TACC and SDSC, as well as other supercomputers available as part of XSEDE (the new NSF-funded Extreme Science and Engineering Discovery Environment). NCGAS will further campus-based integration, known as "campus bridging."

Thomas G. Doak ([email protected]), Le-Shin Wu, Craig A. Stewart, Robert Henschel, William K. Barnett

A specific goal is to provide dedicated access to large memory supercomputers, such as IU's new Mason system. Each Mason compute node has 512GB of random access memory, critical for data-intensive science applications such as genome assembly.

Environmental sequencing

– Sampling DNA sequences directly from the environment

– Since the sequences consists of DNA fragments from hundreds or even thousands of species, the analysis is far more difficult than traditional sequence analysis that involves only one species.

•Assembling metagenomic sequences and deriving genes from the dataset

•Dynamic programming to optimally map consecutive contigs from the assembly.

Since the number of contigs is enormous for most metagenomic dataset, a large memory computing system is required to perform the dynamic programming algorithm so that the task can be completed in polynomial time.

For Indiana University’s Supercomputing 11 research sandbox demo, NCGAS implemented a biological application to simulate a sequence alignment and SNP (single nucleotide polymorphism) identification pipeline (shown above). The goal is to demonstrate that, with a network bridging between NCGAS computing nodes at IU and a remote storage file system, we are able to conduct a data intensive pipeline without repetitive data file movement.