Katerina Kechris , PhD Associate Professor Biostatistics and Informatics

21
SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver

description

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration. Katerina Kechris , PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver. Omics. - PowerPoint PPT Presentation

Transcript of Katerina Kechris , PhD Associate Professor Biostatistics and Informatics

Page 1: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and

Mathematical Challenges

Topic: Data Integration

Katerina Kechris, PhDAssociate Professor

Biostatistics and InformaticsColorado School of Public Health

University of Colorado Denver

Page 2: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Omics

• Large-scale analyses for studying a population of molecules or molecular mechanisms

• High-throughput data• Examples– Genomics (entire genome – DNA)– Proteomics (study of protein repertoire)– Epigenomics (study of DNA and histone modifications)

Page 3: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

OmicsEpigenome

Phenome

Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gifhttp://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gif

Page 4: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Large-scale Projects & Databases

NCI 60 Database

Page 5: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Integration of Omics Data

• Each type of data gives a different snapshot of the biological or disease system

• Why integrate data?• Reduce false positives/negatives• Identify interactions between different

molecules• Explore functional mechanisms

Page 6: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Challenges

1. When to integrate?2. Dimensionality 3. Resolution4. Heterogeneity5. Interactions and Pathways

Page 7: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Challenge 1: When to integrate?

• Early– Merging data to increase sample size

• Intermediate– Convert different data sources into common format

(e.g., ranks, correlation matrices), kernel-based analysis• Late– Meta-analysis (combine effect size or p-value),

aggregate voting for classifiers, genomic enrichment and overlap of significant results

Page 8: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Genomic Meta-analysis:Combining Multiple Transcriptomic Studies

Tseng Lab, U. of Pitt.

Page 9: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Assessing Genomic Overlap:Permutation-based Strategies

Bickel Lab, Berkeley & ENCODEAnn. Appl. Stat. (2010) 4:4 1660-1697.

Page 10: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Challenge 2: Dimensionality

• Most technologies produce 10Ks to 100Ks measurements per sample– Exponential increase with 2+ data types

• Dimension reduction – Process data type separately (filtering)– Combine with model fitting– Multivariate analysis

Page 11: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Sparse Multivariate Methods• Variable Selection,

Discriminant Analysis, Visualization

• Penalties (or regularization) to reduce parameter space, only a few entries are non-zero (sparsity)

• Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS)

Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, StanfordStat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35

Page 12: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Challenge 3: Genomic Resolution• Base level (conservation, motif scores)

• Regular intervals (expression/binding from tiling arrays)

• Irregular intervals– Gene/ncRNA level data (expression)– Individual positions (SNP, methylation sites)

Page 13: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Challenge 4: Heterogeneity

• Technology-specific sources of error• Different pre-processing, normalization• Different amounts of missing values• Data matching– Different identifiers– Not always one-to-one (microarrays)– Imputation

Page 14: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Challenge 4: Heterogeneity

• Continuous – expression and binding data from microarrays, motif

scores, protein/metabolite abundance• Counts – expression data from sequencing

• 0-1 – conservation (UCSC), DNA methylation

• Binary/Categorical – Thresh-holding (e.g., motif scores), genotype

Page 15: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Case Study: DevelopmentCi

• important for differentiation of appendages during development• transcription factor – binds to DNA near target genes

http://www.biology.ualberta.ca/locke.hp/research.htmhttp://howardhughes.trinity.duke.edu

Kechris Lab, CU Denver

Page 16: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Hierarchical Mixture Model• Data- Transcriptome: Ci pathway mutants (expr) – irregular

interval- Genome: DNA binding data of Ci (bind) – regular interval,

DNA conservation across 14 insect species (cons)– base level

• Goal: Predict gene targets of Ci• Hidden variable is gene target – hierarchical

mixture model

Dvorkin et al., 2013 (under review)

Page 17: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Challenge 5: Interactions and Pathways

• Known Pathways– Incorporate information in databases (curated but

sparse)– e.g., KEGG pathways have metabolite – protein

interactions (directed graphs)

• De novo Pathways– Discover novel interactions

Page 18: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Known Pathways

Jornsten, Chalmers & Michailidis, U. MichiganBiostatistics (2012) 13:4 748-761

Joint modeling of metabolite and transcript data to identify active pathways

metabolitegene

Page 19: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

de novo Interactions• Single data

INTEGRATION• Pair-wise

– Correlations (e.g., eQTL)– Bayesian networks

• Multiple– Kernel-based methods – Probabilistic graphical models – Network analysis

gene

SNP

protein

metabolitegene

methylation site

PHENOTYPE

Page 20: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

de novo Interactions

Shojaie Lab U. WashingtonBiometrika (2010) 97 (3): 519-538.

Page 21: Katerina  Kechris , PhD Associate Professor Biostatistics and Informatics

Summary Methodology

1. Meta-analysis2. Permutation-based Methods3. Sparse Multivariate Methods4. Graphical Models5. Network Analysis