A Data Analysis and Coordination Center for the Human Microbiome Project

36
A Data Analysis and Coordination Center for the Human Microbiome Project Owen White Institute for Genome Sciences University of Maryland School of Medicine

description

A Data Analysis and Coordination Center for the Human Microbiome Project. Owen White Institute for Genome Sciences University of Maryland School of Medicine. HMP Initiatives. - PowerPoint PPT Presentation

Transcript of A Data Analysis and Coordination Center for the Human Microbiome Project

A Data Analysis and Coordination Center for the Human Microbiome Project

A Data Analysis and Coordination Center for the Human Microbiome ProjectOwen WhiteInstitute for Genome SciencesUniversity of Maryland School of Medicine

1Initiative 1: Data Resource Generation - sequencing of 400 strains of prokaryotic microbes from different body regions; recruitment of donors; collection of samples; metagenomic sequence analysis;

Initiative 2: Demonstration Projects - relationship between changes in the human microbiome and health or disease onset;

Initiative 3: Technology Development - development of improved culturing techniques; individual microbe sequencing;

Initiative 4: Ethical, Legal, and Social Implications Research - clinical and health; forensics; uses of new technologies; ownership of microbiome;

Initiative 5: Data Analysis and Coordinating Center - tracking, storing and distributing data; data retrieval tools; coordination of analyses and metadata standards; creation of a portal for international activities; and

Initiative 6: Computational Tool Development - new tool development; next generation sequencing platforms; large, complex sequence data; functional data and metadata.HMP Initiatives

DACC Roles and ResponsibilitiesTracking, storing and distributing data

Data and metadata standardization

Distribution of software tools and pipelines

Support for data analysis

Providing a repository of protocols and SOPs

Development of a comprehensive web portalDACC CollaboratorsThe Institute for Genome SciencesProject CoordinationWeb PortalCore PipelinesData and Metadata Management

The Joint Genome InstituteHMP Project Catalog (GOLD)Metagenome Analysis Strategies

Lawrence Berkeley National Lab16S Data Management (greengenes)HMP Data Analysis System (IMG)

University of Colorado at BoulderMetadata StandardsStatistical and Analytical Tools

In partnership with.

www.hmpdacc.org

1482

Reference Genome Sequence & Annotation Download

Reference Genome Sequence & Annotation Download

HMP Project Catalog Relational data modelTracks project statusStores comprehensive metadata Links to public data resourcesProvides search/filtering options

1570HMP Project Catalog* Includes active and targeted projects

Breakdown by Primary Body Site ** Includes active and targeted projects

Breakdown by Primary Body Site *

Contains a complete list of all Reference Strains along with detailed metadata about each. Provides both quick and advanced search and download options.

Reference Genomes: MIGS ComplianceDACC Management Web InterfaceEnforces the population of required fieldsRestricts contents of fields with controlled vocabulariesProvides both individual and bulk update optionsFollowed by QC steps prior to incorporation into the Catalog

Genome Analysis at IMG

Metagenomic WGS DataSarah YoungJohn Martin HMP Data Processing Working GroupMetagenomic WGS DataSarah YoungJohn Martin HMP Data Processing Working Group

Reference context for metagenome analysis

This coming yearVictor MarkowitzNikos KyrpidesJGI:

WGS submission to NCBI Centers and DACC are working with NCBI to use common schema and relevant metadata. Submission guide, usage Aspera client, usage of QIIME available

21NCBI ProjectsHMP Top Level(43021)16S(48489)WGS(43017)Characterizing microbiome of healthy individualsSource = HMP CentersAssociating microbiomewith diseaseSource = Demo Projects(46305)WGS16SReference Genome Top Level(28331)22

HMP-Wide Patient PhenotypeIHMC VariableTotalFraction IdenticalMappableNot mappableNot presentP1P2 PNSUBJID1.000.880.13SUBJID SUBJID Gender0.940.940.06Gender Age0.880.810.060.060.06Age_at_first_visit AgeAtEnrollment Race0.810.440.380.19Race Race_Other_Text Other Rrace0.560.310.250.44Other Race Race_Other Smoking0.380.310.060.63Smoking_status Lab0.310.190.130.060.63Diagnosis TID Smoking_duration0.310.190.130.69Smoking_status Drugs0.310.190.130.69Antacids, Steroids, AntibioticsWeight_kg0.250.250.060.69BP0.190.190.81Height0.190.190.81Disease0.190.060.130.81Institution0.130.000.130.88Dose0.130.060.060.88Duration0.130.060.060.88Start_date0.130.130.88TIDFinish_date0.130.130.88TIDLocation0.130.130.060.81Other Country Drug_name0.060.060.060.88HIV/AIDS0.001.00

Dirk Gevers & Ashlee EarlBroad Institute

CloVR - Cloud Virtual ResourceVirtual MachineTrimming, filteringTree GenerationORFpredictionPhylogeneticDiversityAssemblyCDS, tRNA, rRNA prediction Auto-AnnotationFunctional diversitySequencemappingSNP identificationSequencemappingSNP identificationQuantitative AnalysisQuantitative AnalysisMetagenomicsProkaryotesCommunity ComparisonAlignmentClassificationAssemblySequencemappingSNP identificationQuantitative Analysis16S PCR or RT-PCR ProductsTotal Metagenomic DNA or RNAReferenceSingle-Genomic or Pan-Genomic DNAEukaryoticDNA or RNAReferenceReferenceReferenceEukaryotesGenepredictionAuto-AnnotationAssemblyEukaryoticDNARaw Sequence Data

Local ComputerCompute Cloud

Annotated Sequence Data standardized nomenclature suitable for publicationPI: Florian FrickeTechnical lead: Sam Angiuoli

Large-scale Amazon DeploymentFlorian Fricke, Sam Angiuoli Institute for Genome SciencesPhase 2: More Access Open access dataAnnotated data sets, aggregated, searchableSome pre-computesReference data sets

Research networkProcessed filesAggregated datasets Metadata

We are surveying the community now!See:Heather Huot CreasyCathering JordanPhase 2External users will:Select data sets /results for downloadSearch for specific data Access data archives (may be some with controlled access)See data reports, stats about data, validation process, etcSee information about metadata

Phase 3: Analysis ToolsAnnotation Pipelines RAMMCAP Rapid analysis of Multiple Metagenomes with Clustering and Annotation PipelineShotgunFunctionalizeR

BinningSOrt-ITEMS Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences Community composition, comparative metagenomicsMEGAN (MEtaGenome ANalyzer) CARMA GAAS (Genome relative Abundance and Average Size) GalaxyGINKGO Metarep Suite of web based toolsMetastats compare clinical metagenomic samples from two treatment populationsRAMMCAP - Statistical metagenome comparisonShotgunFunctionalizeR R-package for functional comparison

Visualization Invue API and software suite for large scale data visualization

Online resourcesMy IMG/M tools for analyzing microbiome functional capability MG-RAST - variety of comparative and visualization tools

IGSLBLJennifer WortmanGary AndersenMichelle Gwinn Giglio Todd DeSantisHeather Huot Creasy Navjeet SinghBrandi CantarelVictor MarkowitzJonathan Crabtree Amy ChenJoshua Orvis Cesar ArzeJGIMark Mazaitis Nikos Kyrpides Victor Felix Konstantinos LioliosCatherine Jordan Anup Mahurkar Univ. of Colorado Cornell University : Ruth Ley, Rob KnightSan Diego State: Scott KelleyDan KnightsArgonne National Lab: Folker MeyerJustin Kuczynski

HMP DACC Team35SRA SAMPLE

SRA EXPERIMENT

SRA RUN

SFF FILE

1

1

1

1

1

*

SRA STUDY

FASTQ File