Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman...

30

Transcript of Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman...

Page 1: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to
Page 2: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Big Data in Omicsand Imaging

Integrated Analysis and CausalInference

Page 3: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

CHAPMAN & HALL/CRC Mathematical and Computational Biology Series

Aims and scope: This series aims to capture new developments and summarize what is known

over the entire spectrum of mathematical and computational biology and

medicine. It seeks to encourage the integration of mathematical, statistical,

and computational methods into biology by publishing a broad range of

textbooks, reference works, and handbooks. The titles included in the

series are meant to appeal to students, researchers, and professionals in the

mathematical, statistical and computational sciences, fundamental biology

and bioengineering, as well as interdisciplinary researchers involved in the

techniques and examples, is highly encouraged.

Series Editors

N. F. BrittonDepartment of Mathematical SciencesUniversity of Bath

Xihong LinDepartment of BiostatisticsHarvard University

Nicola MulderUniversity of Cape TownSouth Africa

Maria Victoria Schneider

European Bioinformatics Institute

Mona SinghDepartment of Computer SciencePrinceton University

Proposals for the series should be submitted to one of the series editors above or directly to:CRC Press, Taylor & Francis Group3 Park Square, Milton ParkAbingdon, Oxfordshire OX14 4RNUK

Page 4: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Published Titles

An Introduction to Systems Biology:

Design Principles of Biological Circuits

Uri Alon

Glycome Informatics: Methods

and Applications

Kiyoko F. Aoki-Kinoshita

Computational Systems Biology

of Cancer

Emmanuel Barillot, Laurence Calzone, Philippe Hupé, Jean-Philippe Vert, and Andrei Zinovyev

Python for Bioinformatics, Second Edition

Sebastian Bassi

Quantitative Biology: From Molecular

to Cellular Systems

Sebastian Bassi

Methods in Medical Informatics:

Fundamentals of Healthcare

Programming in Perl, Python, and Ruby

Jules J. Berman

Chromatin: Structure, Dynamics,

Regulation

Ralf Blossey

Computational Biology: A Statistical

Mechanics Perspective

Ralf Blossey

Game-Theoretical Models in Biology

Mark Broom and Jan Rychtár

Computational and Visualization

Techniques for Structural Bioinformatics

Using Chimera

Forbes J. Burkowski

Structural Bioinformatics: An Algorithmic

Approach

Forbes J. Burkowski

Spatial Ecology

Stephen Cantrell, Chris Cosner, and Shigui Ruan

Cell Mechanics: From Single Scale-

Based Models to Multiscale Modeling

Arnaud Chauvière, Luigi Preziosi, and Claude Verdier

Bayesian Phylogenetics: Methods,

Algorithms, and Applications

Ming-Hui Chen, Lynn Kuo, and Paul O. Lewis

Statistical Methods for QTL Mapping

Zehua Chen

An Introduction to Physical Oncology:

How Mechanistic Mathematical

Modeling Can Improve Cancer Therapy

Outcomes

Vittorio Cristini, Eugene J. Koay, and Zhihui Wang

Normal Mode Analysis: Theory and

Applications to Biological and Chemical

Systems

Qiang Cui and Ivet Bahar

Kinetic Modelling in Systems Biology

Oleg Demin and Igor Goryanin

Data Analysis Tools for DNA Microarrays

Sorin Draghici

Statistics and Data Analysis for

Microarrays Using R and Bioconductor,

Second Edition

Sorin Draghici

Computational Neuroscience:

A Comprehensive Approach

Jianfeng Feng

Mathematical Models of Plant-Herbivore

Interactions

Zhilan Feng and Donald L. DeAngelis

Biological Sequence Analysis Using

the SeqAn C++ Library

Andreas Gogol-Döring and Knut Reinert

Gene Expression Studies Using

Affymetrix Microarrays

Hinrich Göhlmann and Willem Talloen

Handbook of Hidden Markov Models

in Bioinformatics

Martin Gollery

Meta-Analysis and Combining

Information in Genetics and Genomics

Rudy Guerra and Darlene R. Goldstein

Page 5: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Differential Equations and Mathematical

Biology, Second Edition

D.S. Jones, M.J. Plank, and B.D. Sleeman

Knowledge Discovery in Proteomics

Igor Jurisica and Dennis Wigle

Introduction to Proteins: Structure,

Function, and Motion

Amit Kessel and Nir Ben-Tal

RNA-seq Data Analysis: A Practical

Approach

Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, and Garry Wong

Introduction to Mathematical Oncology

Yang Kuang, John D. Nagy, and Steffen E. Eikenberry

Biological Computation

Ehud Lamm and Ron Unger

Optimal Control Applied to Biological

Models

Suzanne Lenhart and John T. Workman

Clustering in Bioinformatics and Drug

Discovery

John D. MacCuish and Norah E. MacCuish

Spatiotemporal Patterns in Ecology

and Epidemiology: Theory, Models,

and Simulation

Horst Malchow, Sergei V. Petrovskii, and Ezio Venturino

Stochastic Dynamics for Systems Biology

Christian Mazza and Michel Benaïm

Statistical Modeling and Machine

Learning for Molecular Biology

Alan M. Moses

Engineering Genetic Circuits

Chris J. Myers

Pattern Discovery in Bioinformatics:

Theory & Algorithms

Laxmi Parida

Exactly Solvable Models of Biological

Invasion

Sergei V. Petrovskii and Bai-Lian Li

Computational Hydrodynamics of

Capsules and Biological Cells

C. Pozrikidis

Modeling and Simulation of Capsules

and Biological Cells

C. Pozrikidis

Cancer Modelling and Simulation

Luigi Preziosi

Computational Exome and Genome

Analysis

Peter N. Robinson, Rosario M. Piro, and Marten Jäger

Introduction to Bio-Ontologies

Peter N. Robinson and Sebastian Bauer

Dynamics of Biological Systems

Michael Small

Genome Annotation

Jung Soh, Paul M.K. Gordon, and Christoph W. Sensen

Niche Modeling: Predictions from

Statistical Distributions

David Stockwell

Algorithms for Next-Generation

Sequencing

Wing-Kin Sung

Algorithms in Bioinformatics: A Practical

Introduction

Wing-Kin Sung

Introduction to Bioinformatics

Anna Tramontano

The Ten Most Wanted Solutions in

Protein Bioinformatics

Anna Tramontano

Combinatorial Pattern Matching

Algorithms in Computational Biology

Using Perl and R

Gabriel Valiente

Managing Your Biological Data with

Python

Allegra Via, Kristian Rother, and Anna Tramontano

Published Titles (continued)

Page 6: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Cancer Systems Biology

Edwin Wang

Stochastic Modelling for Systems

Biology, Second Edition

Darren J. Wilkinson

Big Data in Omics and Imaging:

Association Analysis

Momiao Xiong

Big Data Analysis for Bioinformatics

and Biomedical Discoveries

Shui Qing Ye

Bioinformatics: A Practical Approach

Shui Qing Ye

Introduction to Computational Proteomics

Golan Yona

Big Data in Omics and Imaging:

Integrated Analysis and Causal Inference

Momiao Xiong

Published Titles (continued)

Page 7: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

http://taylorandfrancis.com

Page 8: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Big Data in Omicsand Imaging

Integrated Analysis and CausalInference

Momiao Xiong

Page 9: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does notwarrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® softwareor related products does not constitute endorsement or sponsorship by The MathWorks of a particularpedagogical approach or particular use of the MATLAB® software.

CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-0-8153-8710-7 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts havebeen made to publish reliable data and information, but the author and publisher cannot assume responsi-bility for the validity of all materials or the consequences of their use. The authors and publishers haveattempted to trace the copyright holders of all material reproduced in this publication and apologize tocopyright holders if permission to publish in this form has not been obtained. If any copyright material has notbeen acknowledged, please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafterinvented, including photocopying, microfilming, and recording, or in any information storage or retrievalsystem, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and regis-tration for a variety of users. For organizations that have been granted a photocopy license by the CCC, aseparate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are usedonly for identification and explanation without intent to infringe.

Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.com

and the CRC Press Web site athttp://www.crcpress.com

Page 10: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

To Ping

Page 11: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

http://taylorandfrancis.com

Page 12: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Contents

Preface..................................................................................................................xxiiiAuthor..................................................................................................................xxix

1. Genotype–Phenotype Network Analysis...................................................11.1 Undirected Graphs for Genotype Network.......................................1

1.1.1 Gaussian Graphic Model........................................................11.1.2 Alternating Direction Method of Multipliers

for Estimation of Gaussian Graphical Model......................21.1.3 Coordinate Descent Algorithm and Graphical Lasso........61.1.4 Multiple Graphical Models..................................................10

1.1.4.1 Edge-Based Joint Estimation of MultipleGraphical Models..................................................10

1.1.4.2 Node-Based Joint Estimation of MultipleGraphical Models..................................................11

1.2 Directed Graphs and Structural Equation Modelsfor Networks........................................................................................161.2.1 Directed Acyclic Graphs.......................................................161.2.2 Linear Structural Equation Models.....................................171.2.3 Estimation Methods...............................................................21

1.2.3.1 Maximum Likelihood (ML) Estimation.............221.2.3.2 Two-Stage Least Squares Method.......................221.2.3.3 Three-Stage Least Squares Method.....................24

1.3 Sparse Linear Structural Equations...................................................261.3.1 L1-Penalized Maximum Likelihood Estimation................271.3.2 L1-Penalized Two Stage Least Square Estimation............281.3.3 L1-Penalized Three-Stage Least Square Estimation..........31

1.4 Functional Structural Equation Modelsfor Genotype–Phenotype Networks.................................................341.4.1 Functional Structural Equation Models..............................341.4.2 Group Lasso and ADMM for Parameter Estimation

in the Functional Structural Equation Models..................371.5 Causal Calculus...................................................................................41

1.5.1 Effect Decomposition and Estimation.................................411.5.2 Graphical Tools for Causal Inference in Linear SEMs.....44

1.5.2.1 Basics.......................................................................441.5.2.2 Wright’s Rules of Tracing and Path Analysis...46

xi

Page 13: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

1.5.2.3 Partial Correlation, Regression, and PathAnalysis...................................................................48

1.5.2.4 Conditional Independence and D-Separation...501.5.3 Identification and Single-Door Criterion............................521.5.4 Instrument Variables.............................................................551.5.5 Total Effects and Backdoor Criterion..................................581.5.6 Counterfactuals and Linear SEMs.......................................59

1.6 Simulations and Real Data Analysis................................................601.6.1 Simulations for Model Evaluation......................................601.6.2 Application to Real Data Examples....................................62

Appendix 1.A..................................................................................................64Appendix 1.B..................................................................................................67Exercises...........................................................................................................71

2. Causal Analysis and Network Biology.....................................................732.1 Bayesian Networks as a General Framework for Causal

Inference................................................................................................742.2 Parameter Estimation and Bayesian Dirichlet Equivalent

Uniform Score for Discrete Bayesian Networks.............................752.3 Structural Equations and Score Metrics for Continuous

Causal Networks.................................................................................782.3.1 Multivariate SEMs for Generating Node Core Metrics....782.3.2 Mixed SEMs for Pedigree-Based Causal Inference...........79

2.3.2.1 Mixed SEMs............................................................792.3.2.2 Two-Stage Estimate for the Fixed Effects

in the Mixed SEMs................................................822.3.2.3 Three-Stage Estimate for the Fixed Effects

in the Mixed SEMs................................................832.3.2.4 The Full Information Maximum Likelihood

Method....................................................................842.3.2.5 Reduced Form Representation of the Mixed

SEMs........................................................................862.4 Bayesian Networks with Discrete and Continuous Variables......89

2.4.1 Two-Class Network Penalized Logistic Regressionfor Learning Hybrid Bayesian Networks...........................89

2.4.2 Multiple Network Penalized Functional LogisticRegression Models for NGS Data.......................................92

2.4.3 Multi-Class Network Penalized Logistic Regressionfor Learning Hybrid Bayesian Networks...........................93

2.5 Other Statistical Models for Quantifying Node Score Function...942.5.1 Nonlinear Structural Equation Models...............................94

2.5.1.1 Nonlinear Additive Noise Modelsfor Bivariate Causal Discovery............................94

2.5.1.2 Nonlinear Structural Equations for CausalNetwork Discovery.............................................100

xii Contents

Page 14: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

2.5.2 Mixed Linear and Nonlinear Structural EquationModels...................................................................................104

2.5.3 Jointly Interventional and Observational Datafor Causal Inference.............................................................1092.5.3.1 Structural Equation Model for Interventional

and Observational Data......................................1092.5.3.2 Maximum Likelihood Estimation of

Structural Equation Models fromInterventional and Observational Data............112

2.5.3.3 Sparse Structural Equation Models with JointInterventional and Observational Data............115

2.6 Integer Programming for Causal Structure Leaning....................1192.6.1 Introduction..........................................................................1202.6.2 Integer Linear Programming Formulation

of DAG Learning.................................................................1212.6.3 Cutting Plane for Integer Linear Programming..............1262.6.4 Branch-and-Cut Algorithm for Integer Linear

Programming........................................................................1292.6.5 Sink Finding Primal Heuristic Algorithm........................130

2.7 Simulations and Real Data Analysis..............................................1322.7.1 Simulations...........................................................................1322.7.2 Real Data Analysis..............................................................134

Software Package.........................................................................................137Appendix 2.A Introduction to Smoothing Splines..................................137Appendix 2.B Penalized Likelihood Function for Jointly

Observational and Interventional Data...........................162Exercises.........................................................................................................171

3. Wearable Computing and Genetic Analysisof Function-Valued Traits..........................................................................1733.1 Classification of Wearable Biosensor Data....................................174

3.1.1 Introduction..........................................................................1743.1.2 Functional Data Analysis for Classification of Time

Course Wearable Biosensor Data......................................1753.1.3 Differential Equations for Extracting Features

of the Dynamic Process and for Classificationof Time Course Data...........................................................1763.1.3.1 Differential Equations with Constant

and Time-Varying Parameters for Modelinga Dynamic System...............................................176

3.1.3.2 Principal Differential Analysis for Estimationof Parameters in Differential Equations...........177

3.1.3.3 QRS Complex Example......................................179

Contents xiii

Page 15: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

3.1.4 Deep Learning for Physiological Time SeriesData Analysis.......................................................................1873.1.4.1 Procedures of Convolutional Neural

Networks for Time Course Data Analysis.......1883.1.4.2 Convolution is a Powerful Tool for Liner

Filter and Signal Processing...............................1883.1.4.3 Architecture of CNNs.........................................1913.1.4.4 Convolutional Layer...........................................1933.1.4.5 Parameter Estimation..........................................197

3.2 Association Studies of Function-Valued Traits.............................2013.2.1 Introduction..........................................................................2013.2.2 Functional Linear Models with Both Functional

Response and Predictors for Association Analysisof Function-Valued Traits...................................................203

3.2.3 Test Statistics........................................................................2063.2.4 Null Distribution of Test Statistics....................................2073.2.5 Power.....................................................................................2093.2.6 Real Data Analysis..............................................................2123.2.7 Association Analysis of Multiple Function-Valued

Traits......................................................................................2173.3 Gene–Gene Interaction Analysis of Function-Valued Traits.......221

3.3.1 Introduction..........................................................................2213.3.2 Functional Regression Models...........................................2223.3.3 Estimation of Interaction Effect Function.........................2233.3.4 Test Statistics........................................................................2263.3.5 Simulations...........................................................................227

3.3.5.1 Type 1 Error Rates...............................................2273.3.5.2 Power.....................................................................228

3.3.6 Real Data Analysis..............................................................233Appendix 3.A Gradient Methods for Parameter Estimation

in the Convolutional Neural Networks..........................234Exercises.........................................................................................................246

4. RNA-Seq Data Analysis.............................................................................2474.1 Normalization Methods on RNA-Seq Data Analysis..................247

4.1.1 Gene Expression...................................................................2474.1.2 RNA Sequencing Expression Profiling.............................2494.1.3 Methods for Normalization................................................250

4.1.3.1 Total Read Count Normalization......................2514.1.3.2 Upper Quantile Normalization.........................2514.1.3.3 Relative Log Expression (RLE)..........................2534.1.3.4 Trimmed Mean of M-Values (TMM)................2544.1.3.5 RPKM, FPKM, and TPM....................................255

xiv Contents

Page 16: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

4.1.3.6 Isoform Expression Quantification...................2574.1.3.7 Allele-Specific Expression Estimation

from RNA-Seq Data with Diploid Genomes.....2674.2 Differential Expression Analysis for RNA-Seq Data....................271

4.2.1 Distribution-Based Approach to DifferentialExpression Analysis.............................................................2724.2.1.1 Poisson Distribution............................................2724.2.1.2 Negative Binomial Distribution.........................279

4.2.2 Functional Expansion Approach to DifferentialExpression Analysis of RNA-Seq Data.............................2844.2.2.1 Functional Principal Component Expansion

of RNA-Seq Data.................................................2854.2.3 Differential Analysis of Allele Specific Expressions

with RNA-Seq Data.............................................................2864.2.3.1 Single-Variate FPCA for Testing ASE

or Differential Expression...................................2894.2.3.2 Allele-Specific Differential Expression

by Bivariate Functional PrincipalComponent Analysis...........................................290

4.2.3.3 Real Data Application.........................................2934.3 eQTL and eQTL Epistasis Analysis with RNA-Seq Data............300

4.3.1 Matrix Factorization............................................................3014.3.2 Quadratically Regularized Matrix Factorization

and Canonical Correlation Analysis.................................3024.3.3 QRFCCA for eQTL and eQTL Epistasis Analysis

of RNA-Seq Data.................................................................3034.3.3.1 QRFCCA for eQTL Analysis..............................3034.3.3.2 Data Structure for Interaction Analysis...........3034.3.3.3 Multivariate Regression......................................3044.3.3.4 CCA for Epistasis Analysis................................304

4.3.4 Real Data Analysis..............................................................3064.3.4.1 RNA-Seq Data and NGS Data...........................3064.3.4.2 Cis-Trans Interactions..........................................306

4.4 Gene Co–Expression Network and Gene RegulatoryNetworks.............................................................................................3094.4.1 Co-Expression Network Construction with RNA-Seq

Data by CCA and FCCA....................................................3094.4.1.1 CCA Methods for Construction of Gene

Co-Expression Networks....................................3104.4.1.2 Bivariate CCA for Construction

of Co-Expression Networks with ASE Data....3114.4.2 Graphical Gaussian Models...............................................3124.4.3 Real Data Applications.......................................................314

Contents xv

Page 17: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

4.5 Directed Graph and Gene Regulatory Networks.........................3164.5.1 General Procedures for Inferring Genome-Wide

Regulatory Networks..........................................................3164.5.2 Hierarchical Bayesian Networks for Whole Genome

Regulatory Networks..........................................................3184.5.2.1 Summary Statistics for Representation

of Groups of Gene Expressions.........................3194.5.2.2 Low Rank Presentation Induced Causal

Network................................................................3224.5.3 Linear Regulatory Networks..............................................3294.5.4 Nonlinear Regulatory Networks.......................................330

4.6 Dynamic Bayesian Network and Longitudinal ExpressionData Analysis.....................................................................................3344.6.1 Dynamic Structural Equation Models

with Time-Varying Structures and Parameters...............3354.6.2 Estimation and Inference for Dynamic Structural

Equation Models with Time-Varying Structuresand Parameters.....................................................................3404.6.2.1 Maximum Likelihood (ML) Estimation...........3414.6.2.2 Generalized Least Square Estimation...............342

4.6.3 Sparse Dynamic Structural Equation Models..................3454.6.3.1 L1-Penalized Maximum Likelihood

Estimation.............................................................3454.6.3.2 L1 Penalized Generalized Least Square

Estimator...............................................................3494.7 Single Cell RNA-Seq Data Analysis, Gene Expression

Deconvolution, and Genetic Screening..........................................3524.7.1 Cell Type Identification......................................................3534.7.2 Gene Expression Deconvolution and Cell

Type-Specific Expression....................................................3574.7.2.1 Gene Expression Deconvolution

Formulation..........................................................3574.7.2.2 Loss Functions and Regularization...................3594.7.2.3 Algorithms for Fitting Generalized Low

Rank Models.........................................................361Software Package.........................................................................................364Appendix 4.A Variational Bayesian Theory for Parameter

Estimation and RNA-Seq Normalization.......................365Appendix 4.B Log-linear Model for Differential Expression

Analysis of the RNA-Seq Data with NegativeBinomial Distribution........................................................378

Appendix 4.C Derivation of ADMM Algorithm.....................................390Appendix 4.D Low Rank Representation Induced Sparse

Structural Equation Models..............................................394

xvi Contents

Page 18: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Appendix 4.E Maximum Likelihood (ML) Estimationof Parameters for Dynamic Structural EquationModels..................................................................................404

Appendix 4.F Generalized Least Squares Estimator of theParameters in Dynamic Structural EquationModels..................................................................................407

Appendix 4.G Proximal Algorithm for L1-Penalized MaximumLikelihood Estimation of Dynamic StructuralEquation Model..................................................................411

Appendix 4.H Proximal Algorithm for L1-Penalized GeneralizedLeast Square Estimation of Parameters in theDynamic Structural Equation Models.............................417

Appendix 4.I Multikernel Learning and Spectral Clusteringfor Cell Type Identification...............................................420

Exercises.........................................................................................................427

5. Methylation Data Analysis........................................................................4315.1 DNA Methylation Analysis.............................................................4315.2 Epigenome-Wide Association Studies (EWAS)............................434

5.2.1 Single-Locus Test.................................................................4345.2.2 Set-Based Methods...............................................................434

5.2.2.1 Logistic Regression Model.................................4345.2.2.2 Generalized T2 Test Statistic..............................4355.2.2.3 PCA........................................................................4355.2.2.4 Sequencing Kernel Association Test (SKAT)......4365.2.2.5 Canonical Correlation Analysis.........................436

5.3 Epigenome-Wide Causal Studies....................................................4375.3.1 Introduction..........................................................................4375.3.2 Additive Functional Model for EWCS.............................438

5.3.2.1 Mathematic Formulation of EACS....................4385.3.2.2 Parameter Estimation..........................................4395.3.2.3 Test for Independence.........................................4415.3.2.4 Test Statistics for Epigenome-Wise

Causal Studies......................................................4525.4 Genome-Wide DNA Methylation Quantitative Trait Locus

(mQTL) Analysis...............................................................................4545.4.1 Simple Regression Model...................................................4545.4.2 Multiple Regression Model................................................4545.4.3 Multivariate Regression Model..........................................4555.4.4 Multivariate Multiple Regression Model.........................4555.4.5 Functional Linear Models for mQTL Analysis

with Whole Genome Sequencing (WGS) Data................4555.4.6 Functional Linear Models with Both Functional

Response and Predictors for mQTL Analysiswith Both WGBS and WGS Data......................................456

Contents xvii

Page 19: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

5.5 Causal Networks for Genetic-Methylation Analysis....................4565.5.1 Structural Equation Models with Scalar Endogenous

Variables and Functional Exogenous Variables..............4575.5.1.1 Models...................................................................4575.5.1.2 The Two-Stage Least Squares Estimator..........4595.5.1.3 Sparse FSEMs.......................................................460

5.5.2 Functional Structural Equation Modelswith Functional Endogenous Variables and ScalarExogenous Variables (FSEMs)...........................................4645.5.2.1 Models...................................................................4645.5.2.2 The Two-Stage Least Squares Estimator..........4665.5.2.3 Sparse FSEMs.......................................................467

5.5.3 Functional Structural Equation Models with BothFunctional Endogenous Variables and ExogenousVariables (FSEMF)...............................................................4745.5.3.1 Model.....................................................................4745.5.3.2 Sparse FSEMF for the Estimation

of Genotype-Methylation Networkswith Sequencing Data.........................................477

Software Package.........................................................................................484Appendix 5.A Biased and Unbiased Estimators of the HSIC...............484Appendix 5.B Asymptotic Null Distribution of Block-Based HSIC.....489Exercises.........................................................................................................491

6. Imaging and Genomics..............................................................................4956.1 Introduction........................................................................................4956.2 Image Segmentation..........................................................................496

6.2.1 Unsupervised Learning Methods for ImageSegmentation........................................................................4966.2.1.1 Nonnegative Matrix Factorization....................4966.2.1.2 Autoencoders.......................................................5026.2.1.3 Parameter Estimation of Autoencoders...........5076.2.1.4 Convolutional Neural Networks.......................516

6.2.2 Supervised Deep Learning Methods for ImageSegmentation........................................................................5306.2.2.1 Pixel-Level Image Segmentation.......................5306.2.2.2 Deconvolution Network for Semantic

Segmentation........................................................5366.3 Two- or Three-Dimensional Functional Principal Component

Analysis for Image Data Reduction................................................5386.3.1 Formulation..........................................................................5396.3.2 Integral Equation and Eigenfunctions..............................5406.3.3 Computations for the Function Principal Component

Function and the Function PrincipalComponent Score.................................................................541

xviii Contents

Page 20: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

6.4 Association Analysis of Imaging-Genomic Data..........................5446.4.1 Multivariate Functional Regression Models

for Imaging-Genomic Data Analysis................................5456.4.1.1 Model.....................................................................5456.4.1.2 Estimation of Additive Effects...........................5456.4.1.3 Test Statistics........................................................547

6.4.2 Multivariate Functional Regression Modelsfor Longitudinal Imaging Genetics Analysis...................548

6.4.3 Quadratically Regularized Functional CanonicalCorrelation Analysis for Gene–Gene InteractionDetection in Imaging Genetic Studies..............................5516.4.3.1 Single Image Summary Measure......................5516.4.3.2 Multiple Image Summary Measures................5526.4.3.3 CCA and Functional CCA for Interaction

Analysis.................................................................5526.5 Causal Analysis of Imaging-Genomic Data..................................554

6.5.1 Sparse SEMs for Joint Causal Analysis of StructuralImaging and Genomic Data...............................................555

6.5.2 Sparse Functional Structural Equation Modelsfor Phenotype and Genotype Networks..........................556

6.5.3 Conditional Gaussian Graphical Models (CGGMs)for Structural Imaging and Genomic Data Analysis......557

6.6 Time Series SEMs for Integrated Causal Analysis of fMRIand Genomic Data.............................................................................5586.6.1 Models...................................................................................5586.6.2 Reduced Form Equations...................................................5606.6.3 Single Equation and Generalized Least Square

Estimator...............................................................................5616.6.4 Sparse SEMs and Alternating Direction Method

of Multipliers........................................................................5626.7 Causal Machine Learning.................................................................565Software Package.........................................................................................568Appendix 6.A Factor Graphs and Mean Field Methods

for Prediction of Marginal Distribution..........................569Exercises.........................................................................................................574

7. From Association Analysis to Integrated Causal Inference...............5777.1 Genome-Wide Causal Studies.........................................................578

7.1.1 Mathematical Formulation of Causal Analysis...............5797.1.2 Basic Causal Assumptions..................................................5807.1.3 Linear Additive SEMs with Non-Gaussian Noise..........5817.1.4 Information Geometry Approach......................................584

7.1.4.1 Basics of Information Geometry........................5847.1.4.2 Formulation of Causal Inference

in Information Geometry....................................589

Contents xix

Page 21: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

7.1.4.3 Generalization......................................................5957.1.4.4 Information Geometry for Causal Inference...6017.1.4.5 Information Geometry-Based Causal

Inference Methods...............................................6037.1.5 Causal Inference on Discrete Data....................................618

7.1.5.1 Distance Correlation............................................6197.1.5.2 Properties of Distance Correlation

and Test Statistics................................................6207.1.5.3 Distance Correlation for Causal Inference.......6227.1.5.4 Additive Noise Models for Causal Inference

on Discrete Data..................................................6267.2 Multivariate Causal Inference and Causal Networks..................630

7.2.1 Markov Condition, Markov Equivalence,Faithfulness, and Minimality.............................................631

7.2.2 Multilevel Causal Networks for Integrative Omicsand Imaging Data Analysis................................................6357.2.2.1 Introduction..........................................................6357.2.2.2 Additive Noise Models for Multiple

Causal Networks.................................................6357.2.2.3 Integer Programming as a General

Framework for Joint Estimation of MultipleCausal Networks.................................................642

7.3 Causal Inference with Confounders...............................................6437.3.1 Causal Sufficiency................................................................6447.3.2 Instrumental Variables........................................................6447.3.3 Confounders with Additive Noise Models......................648

7.3.3.1 Models...................................................................6487.3.3.2 Methods for Searching Common

Confounder...........................................................6497.3.3.3 Gaussian Process Regression.............................6517.3.3.4 Algorithm for Confounder Identification

Using Additive Noise Modelsfor Confounder....................................................657

Software Package.........................................................................................658Appendix 7.A Approximation of Log-Likelihood Ratio

for the LiNGAM.................................................................659Appendix 7.B Orthogonality Conditions and Covariance....................664Appendix 7.C Equivalent Formulations Orthogonality Conditions.....667Appendix 7.D M–L Distance in Backward Direction.............................669Appendix 7.E Multiplicativity of Traces..................................................671Appendix 7.F Anisotropy and K–L Distance..........................................680

xx Contents

Page 22: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Appendix 7.G Trace Method for Noise Linear Model............................682Appendix 7.H Characterization of Association.......................................687Appendix 7.I Algorithm for Sparse Trace Method...............................687Appendix 7.J Derivation of the Distribution of the Prediction

in the Bayesian Linear Models.........................................691Exercises.........................................................................................................695

References.....................................................................................................697Index..............................................................................................................711

Contents xxi

Page 23: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

http://taylorandfrancis.com

Page 24: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Preface

Despite significant progress in dissecting the genetic architecture of complexdiseases by association analysis, understanding the etiology and mechanismof complex diseases remains elusive. It is known that significant findings ofassociation analysis have lacked consistency and often proved to be contro-versial. The current approach to genomic analysis lacks breadth (number ofvariables analyzed at a time) and depth (the number of steps which are takenby the genetic variants to reach the clinical outcomes across genomic andmolecular levels) and its paradigm of analysis is association and correlationanalysis. Next generation genomic, epigenomic, sensing, and image tech-nologies are producing ever deeper multiple omic, physiological, imag-ing, environmental, and phenotypic data, the causal inference of whichis a cornerstone of scientific discovery and an essential component for dis-covery of mechanism of diseases. It is time to shift the current paradigm ofgenetic analysis from shallow association analysis to deep causal inferenceand from genetic analysis alone to integrated genomic, epigenomic, imagingand phenotypic data analysis for unraveling the mechanism of psychiatricdisorders.This book is a natural extension of the book Big Data in Omics and Imaging:

Association Analysis. The focus of this book is integrated genomic, epigenomic,and imaging data analysis and causal inference. To make the paradigm shiftfeasible, this book will (1) develop novel or apply existing causal inferencemethods for genome-wide and epigenome-wide causal studies of complexdiseases; (2) develop unified frameworks for systematic casual analysis ofintegrated genomic, epigenomic, image, and clinical phenotype data analysis,and inferring multilevel omic and image causal networks which lead to dis-covery of paths of genetic variants to the disease via multiple omic and imagecausal networks; (3) develop novel and apply existing methods for geneexpression and methylation deconvolution, and develop novel methods forinferring cell specific multiple omic causal networks; and (4) introduce deeplearning for genomic, epigenomic, and imaging data analysis and developmethods for combining deep learning with causal inference.This book is organized into seven chapters. The following is a description of

each chapter. Chapter 1, “Genotype–Phenotype Network Analysis,” studiesdirected and undirected genotype–phenotype networks, which are majortopics of causal inference. Efficient genetic analysis consists of two majorparts: (1) breadth (the number of phenotypes which the genetic variantsaffect) and (2) depth (the number of steps which are taken by the geneticvariants to reach the clinical outcomes). Causal inference theory and chaingraph models provide an innovative analytic platform for deep and precisemultilevel hybrid causal genotype–disease network analysis. Very few

xxiii

Page 25: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

genetic and epigenetic textbooks cover causal inference theory in depth;therefore, Chapter 1 and Chapter 2 will provide solid knowledge and efficienttools for causal inference in genomic and epigenomic analysis. Chapter 1includes (1) undirected graphs for genotype network, (2) alternating directionmethod of multipliers for estimation of Gaussian graphical model, (3) coor-dinate descent algorithm and graphical Lasso, (4) multiple graphical models,(5) directed graphs and structural equation models for networks, (6) sparselinear structural equations, (7) functional structural equation models forgenotype–phenotype networks with next-generation sequencing data, and(8) effect decomposition and estimation.Chapter 2, “Causal Analysis and Network Biology,” covers (1) Bayesian

networks as a general framework for causal inference, (2) structural equationsand score metrics for continuous causal networks, (3) network penalizedlogistic regression for learning hybrid Bayesian networks, (4) statisticalmethods for pedigree-based causal inference, (5) nonlinear structural equa-tion models, (6) mixed linear and nonlinear structural equation models,(7) jointly interventional and observational data for causal inference, and(8) integer programming for causal structure leaning.Chapter 3, “Wearable Computing and Genetic Analysis of Function-Valued

Traits,” studies the genetics of function-valued traits. Early detection of dis-eases and health monitoring are primary goals of health care and diseasemanagement. Physiological traits such as ECG, EEG, SCG, EMG, MEG, andoxygen saturation levels provide important information on the health statusof humans and can be used to monitor and diagnose diseases. Wearablesensors with a capacity of noninvasive and continuous personal healthmonitoring will not only measure health parameters of individuals at rest, butalso generate signals of transient events that may be of profound prognosticor therapeutic importance. These physiological traits are a function-valuedtrait. Analysis of genomic and space-temporal physiological data can providethe holistic genetic structure of disease, but also poses great methodologicaland computational challenges. There is a lack of statistical methods forgenetic analysis of function-valued traits in the literature. In this chapter, wepropose novel statistical methods for genetic analysis of physiological traits.Chapter 3 covers wearable computing for automated disease diagnosis andreal time health care monitoring, deep learning for physiological time seriesdata analysis, functional linear models with both functional response andfunctional predictors for association analysis of physiological traits with next-generation sequencing data, mixed functional linear models with functionalresponse for family-based genetic analysis of physiological traits, functionalregression models with both functional response and functional predictors forgene–gene interaction analysis, and functional canonical correlation analysisfor association studies of physiological traits.Chapter 4, “RNA-Seq Data Analysis,” covers (1) data normalization and

preprocessing, (2) functional principal component analysis test for differentialexpression analysis with RNA-seq or miRNA-seq data, (3) multivariate

xxiv Preface

Page 26: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

functional principal component analysis for allele-specific expressionanalysis, (4) eQTL and eQTL epistasis analysis with RNA-seq data,(5) co-expression networks, (6) linear and nonlinear regulatory networks,(7) gene expression imputation, and (8) genotype–expression regulatorynetworks, (9) dynamic Bayesian networks and longitudinal expression dataanalysis, and (10) single cell RNA-seq data analysis, gene expressiondeconvolution, and genetic screening.Chapter 5, “Methylation Data Analysis,” discusses methylation data anal-

ysis. The statistical methods for differential gene expression, eQTL analysis,and genotype–expression regulatory networks can be easily extended tomethylation data analysis. Epigenome-wide causal studies, a new concept forepigenetic analysis, will be first introduced in this chapter. In addition to theseanalyses, Chapter 5 will put emphasis on inference on whole genome meth-ylation and expression causal networks. Since both gene expression andmethylation data involve more than 20,000 genes, it is impossible to constructa causal network with more than 40,000 nodes. Therefore, multiple levelmethylation-expression networks should be designed. Chapter 5 addressesthree essential issues in the estimation of multiple level methylation expres-sion networks: (1) low rankmodel for representation of either gene expressionor methylation in a pathway or a cluster, (2) construction of methylation andexpression networks using low rankmodel representation of methylation andgene expression in the pathways or clusters, and (3) construction of methyl-ation and gene expression causal networks using original methylation andgene expression values in the local connected pathways or clusters. Chapter 5also investigates the methylations in what cells regulate what cell geneexpression. This chapter presents several novel approaches to methylationand gene expression analysis.Chapter 6, “Imaging and Genomics,” focuses on imaging signal processing,

automatic image diagnosis, and genetic-imaging data analysis. There isincreasing interest in statistical methods and computational algorithms toanalyze high dimensional, space-correlated, and complex imaging data, andclinical and genetic data for disease diagnosis, management, and diseasemechanism research. This chapter covers (1) deep learning for medical imagesemantic segmentation, (2) three-dimensional functional principal componentanalysis for imaging signal extraction, (3) imaging network construction andconnectivity analysis, (4) causal machine learning for automated imagingdiagnosis of disease, (5) multiple functional linear models for imaging geneticsanalysis with next-generation sequencing data, (6) quadratically regularizedfunctional canonical correlation analysis for imaging genetics or imaging RNA-seq data analysis, (7) causal analysis for imaging genetics and imaging RNA-seq data analysis, (8) time series structural equation models for integratedcausal analysis of fMRI and genomic data, and (9) causal machine learning.Chapter 7, “FromAssociation Analysis to Integrated Causal Inference,”will

develop novel statistical methods for genome-wide causal studies andinvestigate integrated genomic, epigenomic, imaging, andmultiple phenotype

Preface xxv

Page 27: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

data analysis. Chapter 7 presents mathematical formulation of causal analysisand discusses principles underlying causation. The criterions for distinguish-ing causation tests from association tests are also introduced in Chapter 7.In genomic and epigenomic data analysis, we usually consider four typesof associations: association of discrete variables with continuous variables,continuous variables with continuous variables, discrete variables with binarytrait, and continuous variables with binary trait (disease status). These fourtypes of association analyses are extended to four types of causation analysesin this chapter. Chapter 7 also covers several powerful tools, including additivenoise models, information geometry, trace methods, and Haar measureand distance correlation, for casual inference. There are multiple stepsbetween genes and phenotypes. Only broadly and deeply searching enormouspath space connecting genetic variants to the clinical outcomes allows usto uncover the mechanism of diseases. Precision medicine demands deep,systematic, comprehensive, and precise analysis of genotype–phenotype – “andthe deeper you go, the more you know.” Chapter 7 proposes to use causalinference theory to develop an innovative analytic platform for deep and precisemultilevel hybrid causal genotype–disease network analysis, which inte-grates gene association subnetworks, environment subnetworks, gene regu-latory subnetworks, causal genetic-methylation subnetworks, methylation-gene expression networks, genotype–gene expression-imaging subnetworks,the intermediate phenotype subnetworks, and multiple disease subnetworksinto a single connected multilevel genotype–disease network to reveal thedeep causal chain of mechanisms underlying the disease. In addition, Chapter7 also covers causal inference with confounders.Overall, this book introduces state-of-the-art studies and practice achieve-

ments in causal inference, deep learning, genomic, epigenomic, imaging, andmultiple phenotype data analysis. This book sets the basis and analytic plat-forms for further research in this challenging and rapidly changing field.The expectation is that the presented concepts, statistical methods, computa-tional algorithms and analytic platforms in the book will facilitate trainingnext-generation statistical geneticists, bioinformaticians, and computationalbiologists.I would like to thank Sara A. Barton for editing the book. I am deeply

grateful to my colleagues and collaborators Li Jin, Eric Boerwinkle, and otherswhom I have worked with for many years. I would especially like to thankmy former and current students and postdoctoral fellows for their strongdedication to the research and scientific contributions to the book: JinyingZhao, Li Luo, Shenying Fang, Nan Lin, Rong Jiao, Zixin Hu, Panpan Wang,Kelin Xu, Dan Xie, Xiangzhong Fang, Jun Li, Shicheng Guo, Shengjun Hong,Pengfei Hu, Tao Xu, Wenjia Peng, Xuesen Wu, Yun Zhu, Dung-Yang Lee,Lerong Li, Getie A. Zewdie, Long Ma, Hua Dong, Futao Zhang, andHoicheong Siu. Finally, I must thank my editor, David Grubbs, for hisencouragement and patience during the process of creating this book.

xxvi Preface

Page 28: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

MATLAB® is a registered trademark of The MathWorks, Inc. For productinformation, please contact:

The MathWorks, Inc.3 Apple Hill DriveNatick, MA 01760-2098 USATel: 508-647-7000Fax: 508-647-7001Email: [email protected]: www.mathworks.com

Preface xxvii

Page 29: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

http://taylorandfrancis.com

Page 30: Big Data in Omics and Imaging · Biology, Second Edition D.S. Jones, M.J. Plank, and B.D. Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Introduction to

Author

Momiao Xiong, is a professor in the Department of Biostatistics and DataScience, University of Texas School of Public Health; a regular member in theGenetics & Epigenetics (G&E) Graduate Program at The University of TexasMD Anderson Cancer, UTHealth Graduate School of Biomedical Science; anda distinguished professor in the school of Life Science, Fudan University,China.

xxix