Multi-omics infrastructure and data for R/Bioconductor
-
Upload
levi-waldron -
Category
Software
-
view
220 -
download
2
Transcript of Multi-omics infrastructure and data for R/Bioconductor
![Page 1: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/1.jpg)
Multi-omics infrastructure and data for R/Bioconductor
Levi Waldron
Sept 29, 2017
![Page 2: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/2.jpg)
Why Bioconductor?
1,400 packages on a backbone of data structures
The Genomic Ranges algebra
Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).
The integrative data container SummarizedExperiment
![Page 3: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/3.jpg)
Bioconductor core data classes
• Rectangular feature x sample data– SummarizedExperiment::SummarizedExperiment()
– (RNAseq count matrix, microarray, …)
• Genomic coordinates– GenomicRanges::GRanges() (1-based, closed interval)
• DNA / RNA / AA sequences– Biostrings::*Stringset()
• Gene sets– GSEABase::GeneSet() GSEABase::GeneSetCollection()
• Single cell data– SingleCellExperiment::SingleCellExperiment()
• Mass spec data – MSnbase::MSnExp()
![Page 4: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/4.jpg)
Credit: Marcel Ramos
Diseases, platforms, and data types ofThe TCGA
33 diseases
50 platforms
19 data types
Multi-assay experiments can be complex
![Page 5: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/5.jpg)
The need for MultiAssayExperiment
Need a core data structure to:
– harmonize single-assay data structures
– relate multiple assays & clinical data
– handle missing and replicate observations
– accommodate ID-based and range-based data
– support on-disk representations of big data
![Page 6: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/6.jpg)
MultiAssayExperiment design
Credit: Marcel Ramos
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
![Page 7: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/7.jpg)
TCGA as MultiAssayExperiments
Access from www.github.com/waldronlab/MultiAssayExperiment
…... 33 cancer types
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
![Page 8: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/8.jpg)
TCGA as MultiAssayExperiments> acc
A MultiAssayExperiment object of 9 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 9:
[1] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 79 columns
[2] miRNASeqGene: ExpressionSet with 1046 rows and 80 columns
[3] CNASNP: RaggedExperiment with 79861 rows and 180 columns
[4] CNVSNP: RaggedExperiment with 21052 rows and 180 columns
[5] Methylation: SummarizedExperiment with 485577 rows and 80 columns
[6] RPPAArray: ExpressionSet with 192 rows and 46 columns
[7] Mutations: RaggedExperiment with 20166 rows and 90 columns
[8] gistica: SummarizedExperiment with 24776 rows and 90 columns
[9] gistict: SummarizedExperiment with 24776 rows and 90 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
>
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
![Page 9: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/9.jpg)
The MultiAssayExperiment API
Credit:Marcel Ramos
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
![Page 10: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/10.jpg)
For building visualizations
Upset Venn diagram for adrenocortical carcinoma TCGA
> data(miniACC)
> upsetSamples(miniACC)
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
![Page 11: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/11.jpg)
For multi-omics analysis
> mae <- mae[, , c("Mutations", "gistict")]
> mae <- intersectColumns(mae)
> mae$cnload <- colMeans(abs(assay(mae[["gistict"]])))
Davoli et al. Tumor aneuploidy correlates with markers of immune evasion and with reduced response to immunotherapy. Science 355, (2017).
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
![Page 12: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/12.jpg)
For integrating remotely stored data
> st <- ldblock::stack1kg() #Create a URL referencing 1000 genomes content in AWS S3> multiban <- MultiAssayExperiment(
list(meth = banovichSE, snp = st),
colData = colData(banovichSE))
> multibanfocus <- multiban[rowRanges(banovichSE)[“cg04793911”], , ]
> assoc <- cisAssoc(multibanfocus[[“meth”]],
TabixFile(files(multibanfocus[[“snp”]])))
Using tabix-indexed SNP VCFs from 1000 genomeson Amazon S3
credit: Vince Carey
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
![Page 13: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/13.jpg)
A big software engineering effort
![Page 14: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/14.jpg)
Past curated*DataBioconductor packages
• curatedOvarianData
– 30 datasets, > 3K unique samples
– survival, surgical debulking, histology...
• curatedCRCData
– 34 datasets, ~4K unique samples
– many annotated for MSS, gender, stage, age, N, M
• curatedBladderData
– 12 datasets, ~1,200 unique samples
– many annotated for stage, grade, OS14
![Page 15: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/15.jpg)
curatedMetagenomicData: motivation
• Increasing amount of public data
• Can be fast and free, but hard to use:
– fastq files from NCBI, EBI, ...
– bioinformatic expertise
– computational resources
– manual curation / standardization
• Wanted to make acquisition of curated, ready-to-use public data easy and reproducible
15
![Page 16: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/16.jpg)
curatedMetagenomicData: pipeline
Download (~57TB)
Uniform processing
MetaPhlAn2 HUMAnN2
species abundance
markerpresence
gene family abundance
marker abundance
metabolic pathway abundance
metabolic pathway presence
standardized metadata
Manual curation
Rawfastq files 13 datasets 2,875 samples
Study metadataAge, body site, disease, etc…
Offline high computational load pipeline> 120 kH CPU
Integrated BioconductorExpressionSet objects
Per-patient microbiome data Per-patient metadata Experiment-wide metadata
Integration
Automatic documentation
ExperimentHub product Amazon S3 cloud distribution Tag-based searching Dataset snapshot dates Automatic local caching
Convenience download functionsMegabytes-sized datasets
Differential abundance Diversity metrics Clustering Machine learning
Userexperience
https://waldronlab.github.io/curatedMetagenomicData/
![Page 17: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/17.jpg)
One dataset from R:> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool”)
, relab=FALSE)
Many datasets from R:> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*”)
Command-line:$ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*"
17
curatedMetagenomicData: use
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
![Page 18: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/18.jpg)
Supervised disease classification
18
Credit: Edoardo Pasolli
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
![Page 19: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/19.jpg)
Unsupervised clustering
19
Credit: Audrey Renson
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
![Page 20: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/20.jpg)
20Credit: Audrey Renson
Unsupervised clustering
![Page 21: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/21.jpg)
Meta-analysis
(partial) validation of reported associations between genera and BMI
Credit: Lucas Schiffer
Beaumont M et al. Heritable components of the human fecal microbiome are associated with visceral fat. Genome Biol. 2016;17:189.
![Page 22: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/22.jpg)
Meta-analysis
“protective” bacteria for CRC• Lower in stool samples of CRC
cases compared to healthy controls
![Page 23: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/23.jpg)
curatedMetagenomicData summary
• 25 datasets (5,716 samples) available
• Six data products per dataset
• Three taxonomy-based from MetaPhlAn2
• Three functional from HUMAnN2
• Reproduce all analyses in manuscript at:
– https://waldronlab.github.io/curatedMetagenomicData/analyses/
• Lowest barrier to entry, highest level of curation of any microbiome data resource
23Pasolli/Schiffer/Manghi et al., bioRxiv 103085
![Page 24: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/24.jpg)
Future work• Integrated databases as HDF5, indexed remote files
– fast remote slicing of ranges, genes, gene families...
• Distribute TCGA, cBioPortal through ExperimentHub
– omics and clinical data as MultiAssayExperiments
• Curated microbial signatures / BugSigDB
![Page 25: Multi-omics infrastructure and data for R/Bioconductor](https://reader030.fdocuments.in/reader030/viewer/2022021507/5a6667b17f8b9a47688b6025/html5/thumbnails/25.jpg)
Thank you
• Lab (www.waldronlab.org / www.waldronlab.github.io)– Lucas Schiffer (curatedMetagenomicData), Marcel Ramos
(MultiAssayExperiment)– Audrey Renson, Andy Samedy, Rimsha Azar, Carmen Rodriguez,
Tiffany Chan, Abzal Bacchus, Jaya Amatya, Ludwig Geistlinger
• Collaborators– Nicola Segata lab
• Francesco Beghini, Edoardo Passoli, Paolo Manghi
– Heidi Jones, Jennifer Dowd, Sharon Perlman, Lorna Thorpe, Robert Burk Lab (NYC-HANES)
– Valerie Obenchain, Martin Morgan (Bioconductor core team)
• CUNY High-performance Computing Center
25