Using prior knowledge from cellular pathways and molecular...

13
Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification Enrico Glaab Corresponding author. Enrico Glaab, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 7 Avenue des Hauts Fourneaux, L-4362 Esch-sur-Alzette, Luxembourg. Tel.: þ352 4666 446186; Fax: þ352 4666 446949; E-mail: [email protected] Abstract For many complex diseases, an earlier and more reliable diagnosis is considered a key prerequisite for developing more effective therapies to prevent or delay disease progression. Classical statistical learning approaches for specimen classifica- tion using omics data, however, often cannot provide diagnostic models with sufficient accuracy and robustness for heterogeneous diseases like cancers or neurodegenerative disorders. In recent years, new approaches for building multivariate biomarker models on omics data have been proposed, which exploit prior biological knowledge from molecular networks and cellular pathways to address these limitations. This survey provides an overview of these recent developments and compares pathway- and network-based specimen classification approaches in terms of their utility for improving model robustness, accuracy and biological interpretability. Different routes to translate omics-based multifactor- ial biomarker models into clinical diagnostic tests are discussed, and a previous study is presented as example. Key words: biomarker modeling; pathway analysis; network analysis; machine learning; cross-study analysis Introduction In spite of the remarkable advances in biomedicine over recent decades, for a wide range of common, systemic and chronic dis- eases, precise molecular markers for early diagnosis are not yet available. In particular, for some of the most prevalent neuro- logic disorders (e.g. Alzheimer’s [1] and Parkinson’s disease [2]) and cancer types (e.g. colorectal cancer [3] and lung cancer [4]), diagnostic tools are often only applicable in the late symptom- atic stages of the disease, provide insufficient sensitivity or spe- cificity or fail to detect clinically relevant disease subtypes. As a consequence, inadequate therapies may be prescribed, treat- ment could often start too late to prevent irreparable damage and the development of new and more effective therapies for early intervention is hampered by misdiagnosed individuals in the study cohorts. While alternative diagnostic approaches can- not be expected to fully address all of these issues, improve- ments in only one of these aspects could already have significant benefits for patients. Even for infectious diseases with relatively unambiguous symptoms, new diagnostic tools may be required because during the incubation period a clear distinction between contagious and uninfected individuals is usually not possible (e.g. this has been a major issue during the recent Ebola virus disease outbreak in Africa [5]). Classical molecular diagnostic tests measure the abundance of only a single biomolecule, e.g. to diagnose prostate cancer, plasma levels of the prostate-specific antigen (PSA) are com- monly assessed. Although these approaches have advantages in terms of simplicity and costs, for complex and heterogeneous diseases, single-molecule biomarkers tend to provide insuffi- cient accuracy and robustness across different patient cohorts (even the widely accepted PSA test for prostate cancer has a lim- ited sensitivity of 72.1% according to a meta-analysis [6]). Building multivariate biomarker models derived from high- throughput omics measurements, e.g. using DNA or protein Enrico Glaab is a research associate in bioinformatics at the Luxembourg Centre for Systems Biomedicine. He develops and applies machine learning approaches to analyze disease-related omics data, with a focus on pathway- and network-based sample classification methods. Submitted: 31 March 2015; Received (in revised form): 11 June 2015 V C The Author 2015. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 Briefings in Bioinformatics, 2015, 1–13 doi: 10.1093/bib/bbv044 Paper Briefings in Bioinformatics Advance Access published July 2, 2015 at University of Luxembourg on August 28, 2015 http://bib.oxfordjournals.org/ Downloaded from

Transcript of Using prior knowledge from cellular pathways and molecular...

Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen

classificationEnrico GlaabCorresponding author Enrico Glaab Luxembourg Centre for Systems Biomedicine (LCSB) University of Luxembourg 7 Avenue des Hauts FourneauxL-4362 Esch-sur-Alzette Luxembourg Tel thorn352 4666 446186 Fax thorn352 4666 446949 E-mail enricoglaabunilu

Abstract

For many complex diseases an earlier and more reliable diagnosis is considered a key prerequisite for developing moreeffective therapies to prevent or delay disease progression Classical statistical learning approaches for specimen classifica-tion using omics data however often cannot provide diagnostic models with sufficient accuracy and robustness forheterogeneous diseases like cancers or neurodegenerative disorders In recent years new approaches for buildingmultivariate biomarker models on omics data have been proposed which exploit prior biological knowledge from molecularnetworks and cellular pathways to address these limitations This survey provides an overview of these recentdevelopments and compares pathway- and network-based specimen classification approaches in terms of their utility forimproving model robustness accuracy and biological interpretability Different routes to translate omics-based multifactor-ial biomarker models into clinical diagnostic tests are discussed and a previous study is presented as example

Key words biomarker modeling pathway analysis network analysis machine learning cross-study analysis

Introduction

In spite of the remarkable advances in biomedicine over recentdecades for a wide range of common systemic and chronic dis-eases precise molecular markers for early diagnosis are not yetavailable In particular for some of the most prevalent neuro-logic disorders (eg Alzheimerrsquos [1] and Parkinsonrsquos disease [2])and cancer types (eg colorectal cancer [3] and lung cancer [4])diagnostic tools are often only applicable in the late symptom-atic stages of the disease provide insufficient sensitivity or spe-cificity or fail to detect clinically relevant disease subtypes As aconsequence inadequate therapies may be prescribed treat-ment could often start too late to prevent irreparable damageand the development of new and more effective therapies forearly intervention is hampered by misdiagnosed individuals inthe study cohorts While alternative diagnostic approaches can-not be expected to fully address all of these issues improve-ments in only one of these aspects could already have

significant benefits for patients Even for infectious diseaseswith relatively unambiguous symptoms new diagnostic toolsmay be required because during the incubation period a cleardistinction between contagious and uninfected individuals isusually not possible (eg this has been a major issue during therecent Ebola virus disease outbreak in Africa [5])

Classical molecular diagnostic tests measure the abundanceof only a single biomolecule eg to diagnose prostate cancerplasma levels of the prostate-specific antigen (PSA) are com-monly assessed Although these approaches have advantagesin terms of simplicity and costs for complex and heterogeneousdiseases single-molecule biomarkers tend to provide insuffi-cient accuracy and robustness across different patient cohorts(even the widely accepted PSA test for prostate cancer has a lim-ited sensitivity of 721 according to a meta-analysis [6])Building multivariate biomarker models derived from high-throughput omics measurements eg using DNA or protein

Enrico Glaab is a research associate in bioinformatics at the Luxembourg Centre for Systems Biomedicine He develops and applies machine learningapproaches to analyze disease-related omics data with a focus on pathway- and network-based sample classification methodsSubmitted 31 March 2015 Received (in revised form) 11 June 2015

VC The Author 2015 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

1

Briefings in Bioinformatics 2015 1ndash13

doi 101093bibbbv044Paper

Briefings in Bioinformatics Advance Access published July 2 2015 at U

niversity of Luxem

bourg on August 28 2015

httpbiboxfordjournalsorgD

ownloaded from

microarrays would in principle offer the potential to accountfor diverse subtypes or facets of diseases provided that thestudied disorder is characterized by heterogeneous molecularmanifestations However limitations of these high-throughputtechniques (eg random noise and systematic biases lack of ro-bustness and reproducibility) and statistical limitations in theanalysis of high-dimensional data (eg the multiple testingproblem [7] feature redundancy and interdependence [8 9] andissues summarized as the lsquocurse of dimensionalityrsquo [10 11])often prevent the discovery of genuine and reproducible pat-terns for diagnostic specimen classification

In recent years new bioinformatics approaches havetherefore been proposed which involve the integration of priorbiological knowledge from cellular pathways and molecular net-works into the model-building procedure These approachesmake use of prior observations according to which differencesin pathway activity between diverse biological conditions tend tobe more stable than differences in individual gene activity [12]and disease-associated biomolecular changes tend to be coordi-nated and display grouping patterns within molecular subnet-works [13] The main rationale is that by exploiting these patternsas prior information when building prediction models spuriousdiscriminative structures in the data can be recognized and morerobust and coherent patterns at the pathway- and network-levelare selected instead of predictive features Although a wide rangeof corresponding machine learning methods is already availablethe benefits and limitations of different approaches are not al-ways clear and an overview on which strategies are adequate forwhich specific goals or applications is lacking

Here a review of recent network- and pathway-based statis-tical learning approaches for high-throughput omics data setanalysis is provided including a critical discussion of limita-tions and the added value of different methodologies forimproving model robustness accuracy and biologicalinterpretability Because not all diseases are characterized byaberrant molecular signatures the application focus is on diag-nostic approaches for diseases with complex molecular mani-festations First methods using information from publiccellular pathway definitions will be discussed then approachesexploiting prior knowledge from genome-scale molecularnetworks Next common limitations of pathway- and network-guided predictive modeling are pointed out as well asshortcomings related to the biospecimen origin processing andstorage and possible techniques to address or alleviate limita-tions in data analysis Finally the potential for translatingomics-derived multivariate sample classification models intochip-based diagnostic devices is discussed and a previoussuccess story is presented

Pathway-based biomarker modeling

Cellular pathway definitions from manually curated databasesincluding KEGG [14] BioCarta [15] WikiPathways [16] Reactome[17] and PID [18] are frequently used knowledge sources for bio-logical data interpretation These pathways are typically de-signed in a subjective manner often using incompleteinformation and inconsistent criteria for judging the relevanceof putative pathway members (see lsquoLimitations and possiblesolution strategiesrsquo section) but this does not affect the meth-odological principles of pathway analysis and results providedby robust analysis techniques are not significantly altered bysmall variations in pathway definitions In particular enrich-ment analysis methods [19ndash23] and pathway activity visualiza-tion tools [24 25] are commonly applied to exploit information

from these databases and obtain a better understanding ofpathway-level molecular alterations in omics data The utilityof pathway information for biomarker discovery has also beenrecognized early [26] and more recently a variety of bioinfor-matics approaches have been proposed to directly integrateknowledge from pathways into predictive model building foromics sample classification

These methods mainly differ in terms of their pathway ac-tivity scoring approach (ie the approach used to summarizeand score biomolecular activity data for cellular pathways aspredictive features for classification tasks eg via different di-mension reduction methods see discussion below) and the pre-diction method used (ie the machine learning algorithmapplied to the summarized pathway activity scores to build asample classification model) Table 1 shows an overview of dif-ferent methodologies and their main characteristics describedin more detail as follows

Averaging and dimension reduction approachesto score pathway activity

One of the first machine learning approaches for high-through-put data analysis guided by pathway knowledge was proposedby Guo et al [27] Their method classified microarray cancersamples by computing mean or median expression levels of thegene members in biological process modules from the GeneOntology (GO) database as input for a decision tree classifier[36] This straightforward averaging approach was already re-ported to provide benefits in terms of robustness to measure-ment noise and comparable or better classification performanceon cancer microarray data as compared with conventionalgene-level analyses

Soon thereafter other research groups explored whether al-ternative ways to summarize molecular activity data in path-ways could exploit the information content more effectivelyTomfohr et al [28] showed that applying a dimension reductionapproach Singular Value Decomposition (SVD) to gene expres-sion levels of pathway members and using the first eigenvector(called lsquometa-genersquo) to derive a pathway activity representationenabled the discovery of statistically significant pathway alter-ations in disease-related microarray data sets

Because only a subset of pathway members may be involvedin disease-associated changes Lee et al [29] later proposed toscore pathway activity by focusing on condition-responsivegenes (CORGs) within pathways They applied a greedy searchalgorithm to find a gene subset that maximizes a t-statisticscore for sample class discrimination using a weighted sum ofnormalized gene expression levels for the selected input genesOn multiple microarray cancer data sets the CORG-based path-way predictors provided increased discriminative power incomparison with gene-based classification and the rankings ofpathway alterations were more reproducible when tested across100 alternative 2-fold splits of each data set

A first probabilistic method to infer pathway activity fromomics data was developed by Su et al [30] It estimates the loglikelihood ratio (LLR) between different phenotype hypotheses(eg disease or control) from the expression level of each path-way member gene i (LLRifrac14 log[fi

1(xij) fi

2(xij)] where fi

y(x) is theconditional probability density function of gene i under pheno-type y) and then uses these ratios for a weighted combinationof expression levels into a summarized pathway activity finger-print aj (ajfrac14Ri

n LLRi (xij)) assuming independent contributions

of the LLRs for different member genes The authors comparedtheir approach on two breast cancer data sets against gene-level

2 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

predictors the meanmedian pathway summarization theSVD- and CORG-based techniques described above and theirmethod achieved the highest predictive performance in bothwithin-data set cross-validation experiments and one out oftwo cross-data set prediction experiments (the mean summar-ization scored higher in one case)

While the techniques discussed above mostly summarizepathway activity via (weighted) averages or dimension reduc-tion approaches it was noted that disease-associated pathwayalterations may affect the variance of expression levels in path-ways more strongly than the mean Generally variance pat-terns in disease-related omics data can reflect inter-patientvariation (eg differences in the etiology or disease stage) orintra-patient fluctuations (eg onoff phases as those observedin Parkinsonrsquos disease or imbalances in the activity of specificcellular pathways) and the study of both of these sources ofvariation may be important to improve the understanding ofthe disease Glaab et al [31] therefore proposed a variance-basedapproach to identify discriminative pathway signaturesproviding a possibility to account for variation in the activity ofcellular processes for a single patient and for variation acrossmultiple individuals and combined it with diverse classificationapproaches as part of a public Web application Apart frommodifying the pathway activity representation they developeda rule-based learning method comparing activities in pairs ofpathways rather than comparing activities of single pathwaysagainst fitted threshold values to increase the robustness ofpathway-based classification models [37] Rule-based modelshave high stability with regard to small variations in the inputdata and facilitate the biological interpretation of predictionmodels by using only combinations of simple if-then-else rulesfor sample classification [38]

A further strategy to improve classifier robustness is to usediscretized pathway activity scores or counting statistics egSvenson et al [32] presented an approach that first determines

how many genes are up- or downregulated in a pathway using apredefined fold-change threshold This information was com-bined into a ratio score which equals 1 if all pathway membersare upregulated 0 if equal numbers of genes are up- anddownregulated and 1 if all genes are downregulated Aftercomputing this ratio score for all pathway gene sets and train-ing samples from a microarray case-control study they appliednearest centroids classification to the pathway ratio scores toclassify new test set samples In comparison with gene-levelclassification they report that the proposed classification ap-proach performed better for all tested training set sizes

Graph-based and hierarchical pathway activity scoring

Cellular pathway diagrams do not only capture information onthe functional relatedness of biomolecules involved in a path-way but also on the network topology of regulatory or molecu-lar interactions that link them together

Efroni et al [33] presented one of the first pathway analysisapproaches which not only scores pathway activities in omicsdata but also the consistency of molecular activities withinpathways given the topology of interactions between theirmembers Considering the input and output genes for a regula-tory interaction their probabilities of being in an under- oroverexpressed state in a condition of interest can be estimatedfrom gene expression measurements although noise in thedata may make the estimates highly uncertain A consistencyscore can then be obtained by comparing the state probabilitiesfor output genes against their inferred state probabilitiesderived from the estimated states of the interconnected inputgenes and the probability of the relevant interactions to be ac-tive or inactive Thus averaged pathway consistency scores canbe computed and used as predictive features in combinationwith pathway activity scores to classify biological samples rep-resenting different conditions Evaluating this approach on

Table 1 Overview of pathway-based machine learning approaches for supervised classification of omics samples

Publication Pathway activity scoring method Prediction method

Guo et al [27] Mean or median expression levels in GO modules Decision tree classificationTomfohr et al [28] Expression levels are summarized via the first eigenvector

from SVD analysisFocus on predictive feature selection via

t-statistics (any classifier is applicable)Lee et al [29] Normalized sum of expression levels for CORGs within

pathwaysLogistic regression

Su et al [30] Weighted sum of LLRs for pathway members Logistic regression or linear discriminantanalysis

Glaab et al [31] Variance across pathway member activity is quantified andcompared instead of averaged pathway activity

SVM random forest nearest shrunken centroidclassifier or ensemble learning

Svenson et al [32] For each gene set and patient a ratio score is defined de-pending on the number of up- and down-regulated mem-bers in the gene set

Nearest centroids classification

Efroni et al [33] Joint scoring of pathway activity and consistency usinginteractions within pathways

Bayesian linear discriminant classifier

Vaske et al [34] Probabilistic inference is used to predict pathway activitiesfrom a factor graph model of genes and their products

Cox proportional hazard regression (survivalprediction)

Breslin et al [35] Signal transduction pathways with directional interactionsare used to define the activity of a pathway based on theaverage expression of its downstream target genes

No classification is performed but contingencytables show associations between sample-wise pathway activity and clinical sampleclassifications

Kim et al [12] Hierarchical feature vectors are used assigning a two-levelhierarchical structure to the featuresgenes determinedby their pathway membership

SVM

Averaging and dimension reduction approaches are listed on top whereas graph-based and hierarchical pathway activity scoring methods are listed below the bold

black line

Diagnostic specimen classification | 3

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

microarray cancer data by using the new pathway-level featuresas input for a Bayesian Linear Discriminant Classifier the au-thors obtained similar or better predictive performance as com-pared with previously reported results for the test data

Pathway topology information is also exploited in the approachby Vaske et al [34] who converted cellular pathway diagrams intoa factor graph model describing the interactions between theinvolved biomolecules The method can integrate informationfrom different types of omics data for a biological sample and usesan Expectation Maximization (EM) algorithm to infer probabilisticpathway activity scores for each pathwaysample combinationWhen applying the approach to glioblastoma multiform gene ex-pression and copy number data and grouping patients based ontheir pathway alterations the identified subgroups displayed sig-nificant differences in quantitative survival in a KaplanndashMeier ana-lysis Robust patient stratification results were only obtained usingthe pathway-based scores and not via gene-level analyses

Apart from considering the topology of molecular interactionswithin a pathway studying alterations in all the downstream tar-gets of pathway members (ie including targets not belonging tothe pathway) may provide an additional source of information tofurther improve pathway activity scoring Breslin et al [35] pre-sented a corresponding downstream analysis approach whichdetects variation in pathway activity from the degree of correl-ation between downstream targets of a pathway Using two cor-relations scores one considering all pairs of downstream targetsand one considering only pairs without common transcriptionfactors they identified several differentially active pathways inmicroarray data from case-control studies Additionally theydefined an activity score for pathways based on the average ex-pression of their downstream target genes and showed that thesample-wise activities for several pathways are significantlyassociated with clinical sample classifications

Finally although all approaches considered so far use pathway-level sample classification as a replacement for gene-level classifi-cation these are not exclusive options Kim et al [12] proposed tobuild biomarker models from gene expression data at two levelsthe gene and pathway level via a hierarchical feature structureThis structure organizes the individual gene-level features into se-cond-level aggregate features (pathways) used by a support vectormachine (SVM) algorithm to classify samples The authors con-sidered three alternative approaches GLEG (GSEA-based LeadingEdge Gene feature method using predictor genes derived fromgene set enrichment analysis [19] which as a group maximally dif-ferentiate between the sample classes) GPF (GSEA Pathway Featuremethod using enriched pathways as features) and SPF (SVM-basedPathway Feature method using pathway features derived fromSVM) When comparing these methods against classical gene-leveland random gene set predictors using cross-validation and cross-study prediction on microarray cancer data in the majority ofcases the newly proposed methods achieved better results

In summary a wide range of pathway activity inferencemethods and machine learning approaches are available whichcan be combined effectively for pathway-based diagnosticclassification of biological samples While pathway-basedprediction does not necessarily provide superior predictive per-formance in within-data set cross-validation experimentsimproved robustness and accuracy was observed consistentlyin cross-study prediction tasks

Network-based biomarker modeling

Although manually curated pathways have many benefits forthe biological interpretation of large-scale omics data in living

cells metabolic and signaling pathways are not isolated butinterconnected within large and complex molecular and regula-tory networks These networks often include several genes pro-teins or metabolites that are not annotated for any pathwayand therefore ignored by pathway-based analysis methodsConsequently to identify disease-associated modules of inter-connected biomolecules in a more unbiased manner (ie with-out restricting the search space to biomolecules with knownpathway annotations) network-based analysis methods havebeen introduced While pathway-based approaches for bio-marker modeling may have advantages in terms of model inter-pretability the search space exploration in network-basedbiomarker discovery is not restricted by subjectively definedpathway boundaries and the genome-scale molecular networksused as input typically cover significantly larger numbers of bio-molecules than all combined pathways Nevertheless similar tosubjectively defined pathways networks assembled from publicdata sources suffer from various limitations eg missing mo-lecular interactions and lack of tissue-specific annotations andthese issues have to be addressed by dedicated methods (seesection on lsquolimitations and possible solution strategiesrsquo below)In the following two main types of network-based modelingapproaches will be discussed First two-step sequentialapproaches which score the activity in molecular subnetworksand afterward use these activities for predictive machine learn-ing and secondly one-step network analysis approacheswhich exploit network topology information directly within thepredictive model building

Two-step network activity scoring and predictionapproaches

Network activity over multiple interconnected biomoleculescan be summarized and scored using similar averaging or di-mension reduction approaches as in pathway activity scoringmethods However in contrast to the straightforward use ofpredefined pathway definitions first a molecular or regulatorynetwork has to be assembled or reconstructed using either pub-lic molecular interaction databases or applying network infer-ence methods to omics data (in Table 2 an overview of differentmethodologies is shown which are discussed in the following)

A first method to construct new sample-specific gene regu-latory networks for transcriptomics sample classification wasproposed by Tuck et al [39] The networks were generated bydetermining the graph-theoretic intersection between a staticconnectivity network (representing transcription factor bindingto gene promoter regions) obtained using data from theTRANSFAC database [50] with sample-specific coexpressionnetworks (representing transcription-factorndashtarget genecoexpression) derived from gene expression data To extractdiscriminative features for diagnostic specimen classificationfrom these networks they proposed a link-based classificationapproach comparing the activity status of gene regulatoryinteractions (called lsquolinksrsquo) across different sample groups anda degree-based classification method comparing topologicalcentrality measures [51] for the networks When testing theseapproaches on data from different cancer case-control studieshigh cross-validated accuracies were reported for both cell typeand patient sample classification Moreover the network-basedanalysis enabled the authors to identify key transcriptionalregulators altered under specific disease conditions

Instead of constructing new regulatory networks discrim-inative disease-associated network alterations can also be iden-tified by computationally mapping omics data onto in silico

4 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

representations of biochemical proteinndashprotein interaction (PPI)networks Ma et al [40] developed a corresponding approach toobtain more reliable disease association scores for genes by ex-ploiting neighborhood information from a PPI network Theyused a modified Pearson correlation coefficient to assess the as-sociation between microarray gene expression and numeric val-ues encoding the disease status of the samples (taking intoconsideration that these phenotype values may not have a nor-mal distribution) and assigned the Fisher-transformed gene-phenotype association scores to the corresponding proteins in aPPI Next they recalibrate these association scores by modelingthe underlying true scores for each gene using Markov RandomField theory [52] reestimating their values from weighted con-tributions of their network neighborsrsquo original associationscores (the weights are determined according to different net-work neighborhood definitions using either direct neighborsshortest path or diffusion kernel neighborhoods see [40] fordetails) When evaluating the utility of the recalibrated scoresfor disease gene prioritizations on microarray data using knownGene Ontology functional annotations conventional

prioritization approaches using only gene expression or PPI datawere outperformed (although the scoring approach could alsobe used for predictive model building this particular applicationwas not considered)

While the approach by Ma et al focuses on improving thedisease-association scores for individual genes Chuang et al[41] presented a method identifying and scoring entire disease-related subnetworks similar to their pathway association scor-ing approach discussed above (see Lee et al [29]) After comput-ing the mutual information (MI) between sample phenotypevalues (encoding the presence or absence of a disease) and dis-cretized expression values for each gene from a microarray dataset assigned to the proteins in a PPI they applied a greedysearch to expand the seed nodes in the network with locallymaximal MI scores Specifically each seed node was expandedsuch that the sum of scores for the expanded network moduleis maximized (the search stops when no extension increasesthe total score above a predefined improvement rate) Whentraining logistic regression classifiers on the normalized andaveraged activities of the resulting subnetworks for breast

Table 2 Overview of network-based methods for machine learning analysis of omics data sequential network activity scoring and predictionmethods are shown on top whereas machine learning approaches using embedded network-based feature selection are listed below the boldblack line

Methodologypublication

Network activityalteration scoring method Prediction method

Tuck et al [39] Sample-specific gene regulatory networks are constructed andsubnetwork activity is scored by summing over activeinteractions

Nearest neighbors decision tree NaıveBayes among others

Ma et al [40] Disease association is scored for genes based on gene expres-sion data and their neighborsrsquo association scores in a PPI net-work using Markov Random Field theory

The approach is evaluated for disease geneprioritization but is applicable for predict-ive feature selection in combination withany prediction method

Chuang et al [41] Normalized gene expression data is mapped onto a proteininteraction network and discriminative subnetworks areidentified via a greedy search procedure

Logistic regression

Taylor et al [42] Hub nodes in protein interaction networks are determined andthe relative gene expression of hubs with each of their inter-acting partners is computed to identify hubs with diverserelative expression across sample groups

Affinity propagation clustering is used to as-sign a probability of poor prognosis tobreast cancer patients

Petrochilos et al [43] A random walk community detection algorithm is applied todiscover modules in a molecular interaction network andgene expression data is used to identify disease-associatedmodules

The approach is used to identify cancer-asso-ciated network modules and validated byscoring the enrichment of known cancer-related genes extracted from the OMIMdatabase

Rapaport et al [44] Spectral decomposition of gene expression profiles is appliedwith respect to the eigenfunctions of a network graphattenuating the high-frequency components of the expres-sion profiles with respect to the graph topology

SVM

Li et al [45] A network-constrained regularization procedure for linear re-gression analysis is used to identify disease-related discrim-inative subnetworks

Penalized linear regression

Yang et al [46] Three machine learning methods for graph-guided feature se-lection and grouping are proposed including a convex func-tion and two non-convex formulations designed to reducethe estimation bias

Penalized least squares-based approach(GOSCAR Graph octagonal shrinkage andclustering algorithm for regression)

Lorbert et al [47 48] A sparse regression approach is proposed using the PEN pen-alty to favor the grouping of strongly correlated featuresbased on pairwise similarities (eg derived from a molecularinteraction graph)

Penalized regression (PEN penalty)

Vlassis et al [49] Penalized logistic regression is applied using a convex PEN pen-alty function (see approach by Lorbert et al) with absolutefeature weights to better reflect the relevance of discrimina-tive genes in the feature selection

Penalized logistic regression (PEN penaltywith absolute feature weights)

Diagnostic specimen classification | 5

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer data the authors found that the subnetwork markerswere more reproducible than single-gene markers andprovided higher accuracy in distinguishing metastatic fromnonmetastatic tumors

As an intermediate solution between focusing on individualbiomolecules and entire network modules Taylor et al [42] pro-posed a method that investigates network nodes with outstand-ing topological properties and their direct neighbors Aftercomputationally mapping breast cancer gene expression dataonto the in silico representation of a PPI network they deter-mined proteins with large numbers of biochemical interactionpartners (so-called lsquohub nodesrsquo) and computed their relative ex-pression compared with each of these interacting partnersThey then determined for which hubs the relative expressiondiffered significantly between long-term survivors and patientswho died from the disease and applied a clustering approach toassign a probability of poor prognosis to new patient samples(the specific method used is known as lsquoaffinity propagationclusteringrsquo in the literature) The approach was evaluated using5-fold cross-validation providing accuracy sensitivity and spe-cificity estimates that compared favorably with the reportedresults for commercially available genomic breast cancerdiagnostics

Instead of considering topological properties of individualnodes in a molecular network information from a networkgraph can also be extracted via algorithms for finding sub-graphs which stand out in terms of their high density ofmolecular interactions (using approaches referred to as lsquocom-munity identificationrsquo or lsquograph clusteringrsquo methods in the lit-erature) Petrochilos et al [43] proposed a correspondingapproach which first applies a graph-based random walk algo-rithm on a genome-scale molecular network Information fromcancer-related gene expression data was then integrated intothe network by setting the weight of each network node as themaximum fold change of probes corresponding to its gene sym-bol (weights for biochemical interactions are determined by thesquare of the mean of the absolute fold changes of the relevantinteraction partners) Finally the score of a network module ofconnected nodes was obtained by comparing its cumulative ac-tivity (ie the square of the average weighted expression for allits nodes) against a bootstrap distribution of cumulative activ-ities obtained via random sampling of a matched number offold change values When testing the enrichment of known can-cer genes in the top-scored network modules identified withthis approach a similar or better performance was reached incomparison with other widely used module-finding algorithms(potential alternative applications of the identified modules forbiomarker modeling were not assessed in this publication)

Apart from averaging molecular activities over networkneighborhoods or using community identification methods sig-nal processing techniques may provide a further means to gleanuseful information from a network for predictive model build-ing as shown in an approach by Rapaport et al [44] They madeuse of the observation that genes in close proximity to eachother in a network tend to have similar expression and pro-posed to denoise microarray measurements by removing theirhigh-frequency component over the network For this purposespectral decomposition of gene expression profiles with respectto a molecular network graph was applied followed by attenu-ating high-frequency signal components expected to representmeasurement noise The method was evaluated for supervisedanalysis of irradiated and nonirradiated yeast strains using aSVM providing similar classification performance as a modelbuilt without the network-based filtering but facilitating

biological data interpretation by grouping the selected biomol-ecules according to their participation in the network modules

One-step machine learning approaches for networkanalysis

In contrast to the network analysis approaches considered sofar which apply network feature extraction and predictive ma-chine learning analysis in separate steps more recently one-step network-based feature selection approaches have beenproposed integrating the attribute selection directly into thepredictive model building Most of these approaches formulatethe model building task as an optimization problem formula-tion in which the objective function for classification or regres-sion is extended by a penalty term promoting the selection ofgrouped features in a molecular network (this strategy is alsoreferred to as network-constraint regularization)

Li et al [45] proposed one of the first correspondingapproaches by adding a penalty term to linear regression incor-porating network information into the analysis via theLaplacian matrix of the network graph The approach penalizesthe L1-norm of the feature weights and encourages a smoothprofile of weights over neighboring nodes in the networkHowever Binder and Schumacher later reported that themethod has lower performance than a null model ie a modelnot using any covariate information [53] As possible explan-ations they note that Li et al discarded censored observationsand about 20 000 variables that could not be assigned to corres-ponding nodes in the molecular interaction network (see sec-tion on lsquolimitationsrsquo below) Yang et al [46] suggested that thepreviously used network grouping penalties can introduce add-itional estimation bias into the model when the coefficientsigns for two features connected in the graph are differentThey presented alternative penalties to achieve network group-ing and sparse feature selection in particular two non-convexpenalties which shrink only small differences in absolute val-ues of feature weights to reduce the estimation bias [46] In ex-periments on synthetic data and two real data sets the newapproaches outperformed previous feature grouping methods

However with non-convex penalties finding global optimalsolutions is often not feasible and even identifying good localoptima may require high computational effort Lorbert et al [4748] proposed an alternative generic convex penalty thePairwise Elastic Net (PEN) that provides sparse feature selectionand promotes the grouping of attributes according to auser-defined feature similarity measure (eg obtained from bio-chemical interaction weights in a molecular network) PEN is ageneralization of the Elastic Net a method providing a trade-offbetween L1- and L2-penalized regressions by an adjustable par-ameter In PEN this parameter can be replaced to determine thetrade-off using additional information from an attributesimilarity matrix (different instances of PEN can be defined aslong as the similarity matrix is positive semidefinite and non-negative) Comparing PEN against other popular machine learn-ing approaches on simulated data with a grouping structureamong the features PEN achieved a competitive mean squarederror (MSE) and provided sparser solutions than approacheswith similar MSE

More recently Vlassis et al introduced a new instance ofPEN which penalizes differences between the lsquoabsolutersquo valuesof weights of interlinked features in a network graph The mo-tivation behind this approach termed GenePEN is that themagnitude of a weight in a linear model reflects the predictivevalue of the corresponding variable so that the weights for

6 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

irrelevant features are driven to zero by the penalty By ensuringthe convexity of the penalty function global optimal solutionscan be identified efficiently with existing optimization frame-works When evaluating GenePEN on simulated data andreal-word microarray data sets in comparison with other classi-fication methods using feature grouping the method providedsimilar predictive power and gene selections sharing signifi-cantly more connections within a molecular interaction net-work Visualization of the corresponding subnetworks enableda biological interpretation of disease-affected network regionswhich were enriched in known disease-related genes obtainedfrom literature mining

Overall network-based sample classification methods pro-vide a new means to analyze complex omics data sets allowingresearchers to identify coherent molecular network alterationsunder different biological conditions Identifying such network-level patterns in omics data for diseases with complexmolecular manifestations can shed new light on the molecularmechanisms of the disease and facilitate the development ofrobust multifactorial biomarker signatures

In contrast to single-molecule-based biomarker modeling anetwork-level approach has the potential to capture diverse fac-ets of a heterogenic disease reflected via alteration patterns indifferent network regions In comparison with pathway-basedmachine learning approaches methods using genome-scalednetworks as prior knowledge may produce models that are moredifficult to interpret biologically but which can identify a muchwider range of alterations in cellular processes (covering manygenes proteins or metabolites without any known pathway an-notations) Finally network- and pathway-based classificationapproaches share the main benefit of improving model robust-ness in cross-study analyses as compared with using individualbiomolecules as features Among these new higher-level bio-marker signatures network-based signatures accounting formolecular activities over larger and algorithm-derived networkregions may often provide more robust multifactorial markersthan signatures for smaller pathways which are typicallydefined subjectively possibly overlooking relevant functionallyrelated molecules in the surrounding network However themodel robustness will also depend on other factors eg the oc-currence of protein complexes in the studied pathwaynetwork(members of these complexes tend to have highly coordinatedactivity providing more robust averages) and the reliability andcompleteness of the specific network or pathway data sourceused (see limitations discussed in the following section)

Limitations and possible solution strategies

While pathway- and network-based analyses of omics data canenrich the biological interpretation of complex molecular alter-ation patterns and facilitate robust biomarker modeling usersshould also be aware of common limitations behind thesemethods and possibilities to address them

Although detailed functional annotations have been col-lected for a large portion of genes in humans and commonmodel organisms many identified genes still lack any func-tional assignment Consequently a significant portion of genesand proteins analyzed via current high-throughput measure-ment techniques can often not be assigned to any pathway ormolecular network for systems-level analyses Similarly sev-eral gene regulatory interactions and PPIs are still unknownand experimental techniques to identify molecular interactionsmay also generate false-positive results [54] Thus public mo-lecular interaction databases only cover a subset of biochemical

interactions occurring in living cells and the reliability of re-ported interactions varies depending on the amount and type ofassociated evidence

A further limitation of current pathway databases andassembled molecular networks is that tissue- and cell-type spe-cificity of molecular interactions and protein functions is typic-ally not considered in these data resources This can beparticularly problematic when studying diseases showing se-lective vulnerability for specific tissues or cell types eg asfound in many neurodegenerative disorders [55] Similarly inmany biomarker profiling studies and public omics data sour-ces concomitant data from different tissue and cell types for adisease of interest is lacking as well as longitudinal datarequired to identify relevant dynamic changes during diseasepathogenesis and progression Moreover a variety of limita-tions related to the specific origin processing and storage ofspecimens collected for biomarker studies may negatively im-pact subsequent analyses and the comparability of data fromdifferent studies For example studies might differ in terms ofthe sample storage durations in terms of whether the referencetissue for tumor tissue is derived from the same patient or froman unaffected control individual whether a preinfarction speci-men is accepted as a reference for the specimen obtained on in-farct suspicion or whether laser capture microdissection isapplied to isolate specific cells of interest rather than using het-erogeneous cell populations

To account for the varying reliability of public biochemicalinteraction data when assembling a molecular network variousapproaches to determine confidence scores have been proposed[56ndash59] One of the most comprehensive resources is theSTRING PPI database which allows users to filter and extractbiochemical interactions for a wide range of organisms usingdifferent sources of evidence as well as a combined confidencescore [60 61]

A frequently encountered difficulty in pathway analyses isalso the conversion of the information content of public path-way diagrams into adequate data structures for computationalinvestigations Pathway structures usually contain more com-plex layers of information than genendashgene networks eg meta-nodes representing protein complexes or gene families andcatalysis reactions represented by edges pointing to edgesDedicated computational approaches are therefore required toconvert pathway diagrams into uniform network representa-tions eg using biology-driven rules as in the RBioconductorsoftware package lsquographitersquo [62]

Apart from this conversion task building tissue-specificpathwaynetwork representations or assigning probabilities fortissue specificity to the biological interactions in an existingnetwork is a further challenging problem because interactionsare rarely confirmed experimentally in multiple specific tissuesMost bioinformatics approaches for assigning tissue specificityscores to protein interactions use publicly available gene or pro-tein expression data from large-scale tissue profiling studies[63 64] eg predicting the potential for a molecular interactionto occur in a certain tissue depending on whether both inter-action partners are expressed in this tissue [65] Moreover dedi-cated pathway- and network-analysis methods have beendeveloped for tissue-specific disease gene prioritization andgene set enrichment analysis [23 66] which may also provideuseful filters for feature selection in machine learning methodsfor sample classification

While changes in the mean or variance of the overall mo-lecular activity in pathways can provide robust biomarkers forsome diseases pathways may also contain forks and only one

Diagnostic specimen classification | 7

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

microarrays would in principle offer the potential to accountfor diverse subtypes or facets of diseases provided that thestudied disorder is characterized by heterogeneous molecularmanifestations However limitations of these high-throughputtechniques (eg random noise and systematic biases lack of ro-bustness and reproducibility) and statistical limitations in theanalysis of high-dimensional data (eg the multiple testingproblem [7] feature redundancy and interdependence [8 9] andissues summarized as the lsquocurse of dimensionalityrsquo [10 11])often prevent the discovery of genuine and reproducible pat-terns for diagnostic specimen classification

In recent years new bioinformatics approaches havetherefore been proposed which involve the integration of priorbiological knowledge from cellular pathways and molecular net-works into the model-building procedure These approachesmake use of prior observations according to which differencesin pathway activity between diverse biological conditions tend tobe more stable than differences in individual gene activity [12]and disease-associated biomolecular changes tend to be coordi-nated and display grouping patterns within molecular subnet-works [13] The main rationale is that by exploiting these patternsas prior information when building prediction models spuriousdiscriminative structures in the data can be recognized and morerobust and coherent patterns at the pathway- and network-levelare selected instead of predictive features Although a wide rangeof corresponding machine learning methods is already availablethe benefits and limitations of different approaches are not al-ways clear and an overview on which strategies are adequate forwhich specific goals or applications is lacking

Here a review of recent network- and pathway-based statis-tical learning approaches for high-throughput omics data setanalysis is provided including a critical discussion of limita-tions and the added value of different methodologies forimproving model robustness accuracy and biologicalinterpretability Because not all diseases are characterized byaberrant molecular signatures the application focus is on diag-nostic approaches for diseases with complex molecular mani-festations First methods using information from publiccellular pathway definitions will be discussed then approachesexploiting prior knowledge from genome-scale molecularnetworks Next common limitations of pathway- and network-guided predictive modeling are pointed out as well asshortcomings related to the biospecimen origin processing andstorage and possible techniques to address or alleviate limita-tions in data analysis Finally the potential for translatingomics-derived multivariate sample classification models intochip-based diagnostic devices is discussed and a previoussuccess story is presented

Pathway-based biomarker modeling

Cellular pathway definitions from manually curated databasesincluding KEGG [14] BioCarta [15] WikiPathways [16] Reactome[17] and PID [18] are frequently used knowledge sources for bio-logical data interpretation These pathways are typically de-signed in a subjective manner often using incompleteinformation and inconsistent criteria for judging the relevanceof putative pathway members (see lsquoLimitations and possiblesolution strategiesrsquo section) but this does not affect the meth-odological principles of pathway analysis and results providedby robust analysis techniques are not significantly altered bysmall variations in pathway definitions In particular enrich-ment analysis methods [19ndash23] and pathway activity visualiza-tion tools [24 25] are commonly applied to exploit information

from these databases and obtain a better understanding ofpathway-level molecular alterations in omics data The utilityof pathway information for biomarker discovery has also beenrecognized early [26] and more recently a variety of bioinfor-matics approaches have been proposed to directly integrateknowledge from pathways into predictive model building foromics sample classification

These methods mainly differ in terms of their pathway ac-tivity scoring approach (ie the approach used to summarizeand score biomolecular activity data for cellular pathways aspredictive features for classification tasks eg via different di-mension reduction methods see discussion below) and the pre-diction method used (ie the machine learning algorithmapplied to the summarized pathway activity scores to build asample classification model) Table 1 shows an overview of dif-ferent methodologies and their main characteristics describedin more detail as follows

Averaging and dimension reduction approachesto score pathway activity

One of the first machine learning approaches for high-through-put data analysis guided by pathway knowledge was proposedby Guo et al [27] Their method classified microarray cancersamples by computing mean or median expression levels of thegene members in biological process modules from the GeneOntology (GO) database as input for a decision tree classifier[36] This straightforward averaging approach was already re-ported to provide benefits in terms of robustness to measure-ment noise and comparable or better classification performanceon cancer microarray data as compared with conventionalgene-level analyses

Soon thereafter other research groups explored whether al-ternative ways to summarize molecular activity data in path-ways could exploit the information content more effectivelyTomfohr et al [28] showed that applying a dimension reductionapproach Singular Value Decomposition (SVD) to gene expres-sion levels of pathway members and using the first eigenvector(called lsquometa-genersquo) to derive a pathway activity representationenabled the discovery of statistically significant pathway alter-ations in disease-related microarray data sets

Because only a subset of pathway members may be involvedin disease-associated changes Lee et al [29] later proposed toscore pathway activity by focusing on condition-responsivegenes (CORGs) within pathways They applied a greedy searchalgorithm to find a gene subset that maximizes a t-statisticscore for sample class discrimination using a weighted sum ofnormalized gene expression levels for the selected input genesOn multiple microarray cancer data sets the CORG-based path-way predictors provided increased discriminative power incomparison with gene-based classification and the rankings ofpathway alterations were more reproducible when tested across100 alternative 2-fold splits of each data set

A first probabilistic method to infer pathway activity fromomics data was developed by Su et al [30] It estimates the loglikelihood ratio (LLR) between different phenotype hypotheses(eg disease or control) from the expression level of each path-way member gene i (LLRifrac14 log[fi

1(xij) fi

2(xij)] where fi

y(x) is theconditional probability density function of gene i under pheno-type y) and then uses these ratios for a weighted combinationof expression levels into a summarized pathway activity finger-print aj (ajfrac14Ri

n LLRi (xij)) assuming independent contributions

of the LLRs for different member genes The authors comparedtheir approach on two breast cancer data sets against gene-level

2 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

predictors the meanmedian pathway summarization theSVD- and CORG-based techniques described above and theirmethod achieved the highest predictive performance in bothwithin-data set cross-validation experiments and one out oftwo cross-data set prediction experiments (the mean summar-ization scored higher in one case)

While the techniques discussed above mostly summarizepathway activity via (weighted) averages or dimension reduc-tion approaches it was noted that disease-associated pathwayalterations may affect the variance of expression levels in path-ways more strongly than the mean Generally variance pat-terns in disease-related omics data can reflect inter-patientvariation (eg differences in the etiology or disease stage) orintra-patient fluctuations (eg onoff phases as those observedin Parkinsonrsquos disease or imbalances in the activity of specificcellular pathways) and the study of both of these sources ofvariation may be important to improve the understanding ofthe disease Glaab et al [31] therefore proposed a variance-basedapproach to identify discriminative pathway signaturesproviding a possibility to account for variation in the activity ofcellular processes for a single patient and for variation acrossmultiple individuals and combined it with diverse classificationapproaches as part of a public Web application Apart frommodifying the pathway activity representation they developeda rule-based learning method comparing activities in pairs ofpathways rather than comparing activities of single pathwaysagainst fitted threshold values to increase the robustness ofpathway-based classification models [37] Rule-based modelshave high stability with regard to small variations in the inputdata and facilitate the biological interpretation of predictionmodels by using only combinations of simple if-then-else rulesfor sample classification [38]

A further strategy to improve classifier robustness is to usediscretized pathway activity scores or counting statistics egSvenson et al [32] presented an approach that first determines

how many genes are up- or downregulated in a pathway using apredefined fold-change threshold This information was com-bined into a ratio score which equals 1 if all pathway membersare upregulated 0 if equal numbers of genes are up- anddownregulated and 1 if all genes are downregulated Aftercomputing this ratio score for all pathway gene sets and train-ing samples from a microarray case-control study they appliednearest centroids classification to the pathway ratio scores toclassify new test set samples In comparison with gene-levelclassification they report that the proposed classification ap-proach performed better for all tested training set sizes

Graph-based and hierarchical pathway activity scoring

Cellular pathway diagrams do not only capture information onthe functional relatedness of biomolecules involved in a path-way but also on the network topology of regulatory or molecu-lar interactions that link them together

Efroni et al [33] presented one of the first pathway analysisapproaches which not only scores pathway activities in omicsdata but also the consistency of molecular activities withinpathways given the topology of interactions between theirmembers Considering the input and output genes for a regula-tory interaction their probabilities of being in an under- oroverexpressed state in a condition of interest can be estimatedfrom gene expression measurements although noise in thedata may make the estimates highly uncertain A consistencyscore can then be obtained by comparing the state probabilitiesfor output genes against their inferred state probabilitiesderived from the estimated states of the interconnected inputgenes and the probability of the relevant interactions to be ac-tive or inactive Thus averaged pathway consistency scores canbe computed and used as predictive features in combinationwith pathway activity scores to classify biological samples rep-resenting different conditions Evaluating this approach on

Table 1 Overview of pathway-based machine learning approaches for supervised classification of omics samples

Publication Pathway activity scoring method Prediction method

Guo et al [27] Mean or median expression levels in GO modules Decision tree classificationTomfohr et al [28] Expression levels are summarized via the first eigenvector

from SVD analysisFocus on predictive feature selection via

t-statistics (any classifier is applicable)Lee et al [29] Normalized sum of expression levels for CORGs within

pathwaysLogistic regression

Su et al [30] Weighted sum of LLRs for pathway members Logistic regression or linear discriminantanalysis

Glaab et al [31] Variance across pathway member activity is quantified andcompared instead of averaged pathway activity

SVM random forest nearest shrunken centroidclassifier or ensemble learning

Svenson et al [32] For each gene set and patient a ratio score is defined de-pending on the number of up- and down-regulated mem-bers in the gene set

Nearest centroids classification

Efroni et al [33] Joint scoring of pathway activity and consistency usinginteractions within pathways

Bayesian linear discriminant classifier

Vaske et al [34] Probabilistic inference is used to predict pathway activitiesfrom a factor graph model of genes and their products

Cox proportional hazard regression (survivalprediction)

Breslin et al [35] Signal transduction pathways with directional interactionsare used to define the activity of a pathway based on theaverage expression of its downstream target genes

No classification is performed but contingencytables show associations between sample-wise pathway activity and clinical sampleclassifications

Kim et al [12] Hierarchical feature vectors are used assigning a two-levelhierarchical structure to the featuresgenes determinedby their pathway membership

SVM

Averaging and dimension reduction approaches are listed on top whereas graph-based and hierarchical pathway activity scoring methods are listed below the bold

black line

Diagnostic specimen classification | 3

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

microarray cancer data by using the new pathway-level featuresas input for a Bayesian Linear Discriminant Classifier the au-thors obtained similar or better predictive performance as com-pared with previously reported results for the test data

Pathway topology information is also exploited in the approachby Vaske et al [34] who converted cellular pathway diagrams intoa factor graph model describing the interactions between theinvolved biomolecules The method can integrate informationfrom different types of omics data for a biological sample and usesan Expectation Maximization (EM) algorithm to infer probabilisticpathway activity scores for each pathwaysample combinationWhen applying the approach to glioblastoma multiform gene ex-pression and copy number data and grouping patients based ontheir pathway alterations the identified subgroups displayed sig-nificant differences in quantitative survival in a KaplanndashMeier ana-lysis Robust patient stratification results were only obtained usingthe pathway-based scores and not via gene-level analyses

Apart from considering the topology of molecular interactionswithin a pathway studying alterations in all the downstream tar-gets of pathway members (ie including targets not belonging tothe pathway) may provide an additional source of information tofurther improve pathway activity scoring Breslin et al [35] pre-sented a corresponding downstream analysis approach whichdetects variation in pathway activity from the degree of correl-ation between downstream targets of a pathway Using two cor-relations scores one considering all pairs of downstream targetsand one considering only pairs without common transcriptionfactors they identified several differentially active pathways inmicroarray data from case-control studies Additionally theydefined an activity score for pathways based on the average ex-pression of their downstream target genes and showed that thesample-wise activities for several pathways are significantlyassociated with clinical sample classifications

Finally although all approaches considered so far use pathway-level sample classification as a replacement for gene-level classifi-cation these are not exclusive options Kim et al [12] proposed tobuild biomarker models from gene expression data at two levelsthe gene and pathway level via a hierarchical feature structureThis structure organizes the individual gene-level features into se-cond-level aggregate features (pathways) used by a support vectormachine (SVM) algorithm to classify samples The authors con-sidered three alternative approaches GLEG (GSEA-based LeadingEdge Gene feature method using predictor genes derived fromgene set enrichment analysis [19] which as a group maximally dif-ferentiate between the sample classes) GPF (GSEA Pathway Featuremethod using enriched pathways as features) and SPF (SVM-basedPathway Feature method using pathway features derived fromSVM) When comparing these methods against classical gene-leveland random gene set predictors using cross-validation and cross-study prediction on microarray cancer data in the majority ofcases the newly proposed methods achieved better results

In summary a wide range of pathway activity inferencemethods and machine learning approaches are available whichcan be combined effectively for pathway-based diagnosticclassification of biological samples While pathway-basedprediction does not necessarily provide superior predictive per-formance in within-data set cross-validation experimentsimproved robustness and accuracy was observed consistentlyin cross-study prediction tasks

Network-based biomarker modeling

Although manually curated pathways have many benefits forthe biological interpretation of large-scale omics data in living

cells metabolic and signaling pathways are not isolated butinterconnected within large and complex molecular and regula-tory networks These networks often include several genes pro-teins or metabolites that are not annotated for any pathwayand therefore ignored by pathway-based analysis methodsConsequently to identify disease-associated modules of inter-connected biomolecules in a more unbiased manner (ie with-out restricting the search space to biomolecules with knownpathway annotations) network-based analysis methods havebeen introduced While pathway-based approaches for bio-marker modeling may have advantages in terms of model inter-pretability the search space exploration in network-basedbiomarker discovery is not restricted by subjectively definedpathway boundaries and the genome-scale molecular networksused as input typically cover significantly larger numbers of bio-molecules than all combined pathways Nevertheless similar tosubjectively defined pathways networks assembled from publicdata sources suffer from various limitations eg missing mo-lecular interactions and lack of tissue-specific annotations andthese issues have to be addressed by dedicated methods (seesection on lsquolimitations and possible solution strategiesrsquo below)In the following two main types of network-based modelingapproaches will be discussed First two-step sequentialapproaches which score the activity in molecular subnetworksand afterward use these activities for predictive machine learn-ing and secondly one-step network analysis approacheswhich exploit network topology information directly within thepredictive model building

Two-step network activity scoring and predictionapproaches

Network activity over multiple interconnected biomoleculescan be summarized and scored using similar averaging or di-mension reduction approaches as in pathway activity scoringmethods However in contrast to the straightforward use ofpredefined pathway definitions first a molecular or regulatorynetwork has to be assembled or reconstructed using either pub-lic molecular interaction databases or applying network infer-ence methods to omics data (in Table 2 an overview of differentmethodologies is shown which are discussed in the following)

A first method to construct new sample-specific gene regu-latory networks for transcriptomics sample classification wasproposed by Tuck et al [39] The networks were generated bydetermining the graph-theoretic intersection between a staticconnectivity network (representing transcription factor bindingto gene promoter regions) obtained using data from theTRANSFAC database [50] with sample-specific coexpressionnetworks (representing transcription-factorndashtarget genecoexpression) derived from gene expression data To extractdiscriminative features for diagnostic specimen classificationfrom these networks they proposed a link-based classificationapproach comparing the activity status of gene regulatoryinteractions (called lsquolinksrsquo) across different sample groups anda degree-based classification method comparing topologicalcentrality measures [51] for the networks When testing theseapproaches on data from different cancer case-control studieshigh cross-validated accuracies were reported for both cell typeand patient sample classification Moreover the network-basedanalysis enabled the authors to identify key transcriptionalregulators altered under specific disease conditions

Instead of constructing new regulatory networks discrim-inative disease-associated network alterations can also be iden-tified by computationally mapping omics data onto in silico

4 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

representations of biochemical proteinndashprotein interaction (PPI)networks Ma et al [40] developed a corresponding approach toobtain more reliable disease association scores for genes by ex-ploiting neighborhood information from a PPI network Theyused a modified Pearson correlation coefficient to assess the as-sociation between microarray gene expression and numeric val-ues encoding the disease status of the samples (taking intoconsideration that these phenotype values may not have a nor-mal distribution) and assigned the Fisher-transformed gene-phenotype association scores to the corresponding proteins in aPPI Next they recalibrate these association scores by modelingthe underlying true scores for each gene using Markov RandomField theory [52] reestimating their values from weighted con-tributions of their network neighborsrsquo original associationscores (the weights are determined according to different net-work neighborhood definitions using either direct neighborsshortest path or diffusion kernel neighborhoods see [40] fordetails) When evaluating the utility of the recalibrated scoresfor disease gene prioritizations on microarray data using knownGene Ontology functional annotations conventional

prioritization approaches using only gene expression or PPI datawere outperformed (although the scoring approach could alsobe used for predictive model building this particular applicationwas not considered)

While the approach by Ma et al focuses on improving thedisease-association scores for individual genes Chuang et al[41] presented a method identifying and scoring entire disease-related subnetworks similar to their pathway association scor-ing approach discussed above (see Lee et al [29]) After comput-ing the mutual information (MI) between sample phenotypevalues (encoding the presence or absence of a disease) and dis-cretized expression values for each gene from a microarray dataset assigned to the proteins in a PPI they applied a greedysearch to expand the seed nodes in the network with locallymaximal MI scores Specifically each seed node was expandedsuch that the sum of scores for the expanded network moduleis maximized (the search stops when no extension increasesthe total score above a predefined improvement rate) Whentraining logistic regression classifiers on the normalized andaveraged activities of the resulting subnetworks for breast

Table 2 Overview of network-based methods for machine learning analysis of omics data sequential network activity scoring and predictionmethods are shown on top whereas machine learning approaches using embedded network-based feature selection are listed below the boldblack line

Methodologypublication

Network activityalteration scoring method Prediction method

Tuck et al [39] Sample-specific gene regulatory networks are constructed andsubnetwork activity is scored by summing over activeinteractions

Nearest neighbors decision tree NaıveBayes among others

Ma et al [40] Disease association is scored for genes based on gene expres-sion data and their neighborsrsquo association scores in a PPI net-work using Markov Random Field theory

The approach is evaluated for disease geneprioritization but is applicable for predict-ive feature selection in combination withany prediction method

Chuang et al [41] Normalized gene expression data is mapped onto a proteininteraction network and discriminative subnetworks areidentified via a greedy search procedure

Logistic regression

Taylor et al [42] Hub nodes in protein interaction networks are determined andthe relative gene expression of hubs with each of their inter-acting partners is computed to identify hubs with diverserelative expression across sample groups

Affinity propagation clustering is used to as-sign a probability of poor prognosis tobreast cancer patients

Petrochilos et al [43] A random walk community detection algorithm is applied todiscover modules in a molecular interaction network andgene expression data is used to identify disease-associatedmodules

The approach is used to identify cancer-asso-ciated network modules and validated byscoring the enrichment of known cancer-related genes extracted from the OMIMdatabase

Rapaport et al [44] Spectral decomposition of gene expression profiles is appliedwith respect to the eigenfunctions of a network graphattenuating the high-frequency components of the expres-sion profiles with respect to the graph topology

SVM

Li et al [45] A network-constrained regularization procedure for linear re-gression analysis is used to identify disease-related discrim-inative subnetworks

Penalized linear regression

Yang et al [46] Three machine learning methods for graph-guided feature se-lection and grouping are proposed including a convex func-tion and two non-convex formulations designed to reducethe estimation bias

Penalized least squares-based approach(GOSCAR Graph octagonal shrinkage andclustering algorithm for regression)

Lorbert et al [47 48] A sparse regression approach is proposed using the PEN pen-alty to favor the grouping of strongly correlated featuresbased on pairwise similarities (eg derived from a molecularinteraction graph)

Penalized regression (PEN penalty)

Vlassis et al [49] Penalized logistic regression is applied using a convex PEN pen-alty function (see approach by Lorbert et al) with absolutefeature weights to better reflect the relevance of discrimina-tive genes in the feature selection

Penalized logistic regression (PEN penaltywith absolute feature weights)

Diagnostic specimen classification | 5

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer data the authors found that the subnetwork markerswere more reproducible than single-gene markers andprovided higher accuracy in distinguishing metastatic fromnonmetastatic tumors

As an intermediate solution between focusing on individualbiomolecules and entire network modules Taylor et al [42] pro-posed a method that investigates network nodes with outstand-ing topological properties and their direct neighbors Aftercomputationally mapping breast cancer gene expression dataonto the in silico representation of a PPI network they deter-mined proteins with large numbers of biochemical interactionpartners (so-called lsquohub nodesrsquo) and computed their relative ex-pression compared with each of these interacting partnersThey then determined for which hubs the relative expressiondiffered significantly between long-term survivors and patientswho died from the disease and applied a clustering approach toassign a probability of poor prognosis to new patient samples(the specific method used is known as lsquoaffinity propagationclusteringrsquo in the literature) The approach was evaluated using5-fold cross-validation providing accuracy sensitivity and spe-cificity estimates that compared favorably with the reportedresults for commercially available genomic breast cancerdiagnostics

Instead of considering topological properties of individualnodes in a molecular network information from a networkgraph can also be extracted via algorithms for finding sub-graphs which stand out in terms of their high density ofmolecular interactions (using approaches referred to as lsquocom-munity identificationrsquo or lsquograph clusteringrsquo methods in the lit-erature) Petrochilos et al [43] proposed a correspondingapproach which first applies a graph-based random walk algo-rithm on a genome-scale molecular network Information fromcancer-related gene expression data was then integrated intothe network by setting the weight of each network node as themaximum fold change of probes corresponding to its gene sym-bol (weights for biochemical interactions are determined by thesquare of the mean of the absolute fold changes of the relevantinteraction partners) Finally the score of a network module ofconnected nodes was obtained by comparing its cumulative ac-tivity (ie the square of the average weighted expression for allits nodes) against a bootstrap distribution of cumulative activ-ities obtained via random sampling of a matched number offold change values When testing the enrichment of known can-cer genes in the top-scored network modules identified withthis approach a similar or better performance was reached incomparison with other widely used module-finding algorithms(potential alternative applications of the identified modules forbiomarker modeling were not assessed in this publication)

Apart from averaging molecular activities over networkneighborhoods or using community identification methods sig-nal processing techniques may provide a further means to gleanuseful information from a network for predictive model build-ing as shown in an approach by Rapaport et al [44] They madeuse of the observation that genes in close proximity to eachother in a network tend to have similar expression and pro-posed to denoise microarray measurements by removing theirhigh-frequency component over the network For this purposespectral decomposition of gene expression profiles with respectto a molecular network graph was applied followed by attenu-ating high-frequency signal components expected to representmeasurement noise The method was evaluated for supervisedanalysis of irradiated and nonirradiated yeast strains using aSVM providing similar classification performance as a modelbuilt without the network-based filtering but facilitating

biological data interpretation by grouping the selected biomol-ecules according to their participation in the network modules

One-step machine learning approaches for networkanalysis

In contrast to the network analysis approaches considered sofar which apply network feature extraction and predictive ma-chine learning analysis in separate steps more recently one-step network-based feature selection approaches have beenproposed integrating the attribute selection directly into thepredictive model building Most of these approaches formulatethe model building task as an optimization problem formula-tion in which the objective function for classification or regres-sion is extended by a penalty term promoting the selection ofgrouped features in a molecular network (this strategy is alsoreferred to as network-constraint regularization)

Li et al [45] proposed one of the first correspondingapproaches by adding a penalty term to linear regression incor-porating network information into the analysis via theLaplacian matrix of the network graph The approach penalizesthe L1-norm of the feature weights and encourages a smoothprofile of weights over neighboring nodes in the networkHowever Binder and Schumacher later reported that themethod has lower performance than a null model ie a modelnot using any covariate information [53] As possible explan-ations they note that Li et al discarded censored observationsand about 20 000 variables that could not be assigned to corres-ponding nodes in the molecular interaction network (see sec-tion on lsquolimitationsrsquo below) Yang et al [46] suggested that thepreviously used network grouping penalties can introduce add-itional estimation bias into the model when the coefficientsigns for two features connected in the graph are differentThey presented alternative penalties to achieve network group-ing and sparse feature selection in particular two non-convexpenalties which shrink only small differences in absolute val-ues of feature weights to reduce the estimation bias [46] In ex-periments on synthetic data and two real data sets the newapproaches outperformed previous feature grouping methods

However with non-convex penalties finding global optimalsolutions is often not feasible and even identifying good localoptima may require high computational effort Lorbert et al [4748] proposed an alternative generic convex penalty thePairwise Elastic Net (PEN) that provides sparse feature selectionand promotes the grouping of attributes according to auser-defined feature similarity measure (eg obtained from bio-chemical interaction weights in a molecular network) PEN is ageneralization of the Elastic Net a method providing a trade-offbetween L1- and L2-penalized regressions by an adjustable par-ameter In PEN this parameter can be replaced to determine thetrade-off using additional information from an attributesimilarity matrix (different instances of PEN can be defined aslong as the similarity matrix is positive semidefinite and non-negative) Comparing PEN against other popular machine learn-ing approaches on simulated data with a grouping structureamong the features PEN achieved a competitive mean squarederror (MSE) and provided sparser solutions than approacheswith similar MSE

More recently Vlassis et al introduced a new instance ofPEN which penalizes differences between the lsquoabsolutersquo valuesof weights of interlinked features in a network graph The mo-tivation behind this approach termed GenePEN is that themagnitude of a weight in a linear model reflects the predictivevalue of the corresponding variable so that the weights for

6 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

irrelevant features are driven to zero by the penalty By ensuringthe convexity of the penalty function global optimal solutionscan be identified efficiently with existing optimization frame-works When evaluating GenePEN on simulated data andreal-word microarray data sets in comparison with other classi-fication methods using feature grouping the method providedsimilar predictive power and gene selections sharing signifi-cantly more connections within a molecular interaction net-work Visualization of the corresponding subnetworks enableda biological interpretation of disease-affected network regionswhich were enriched in known disease-related genes obtainedfrom literature mining

Overall network-based sample classification methods pro-vide a new means to analyze complex omics data sets allowingresearchers to identify coherent molecular network alterationsunder different biological conditions Identifying such network-level patterns in omics data for diseases with complexmolecular manifestations can shed new light on the molecularmechanisms of the disease and facilitate the development ofrobust multifactorial biomarker signatures

In contrast to single-molecule-based biomarker modeling anetwork-level approach has the potential to capture diverse fac-ets of a heterogenic disease reflected via alteration patterns indifferent network regions In comparison with pathway-basedmachine learning approaches methods using genome-scalednetworks as prior knowledge may produce models that are moredifficult to interpret biologically but which can identify a muchwider range of alterations in cellular processes (covering manygenes proteins or metabolites without any known pathway an-notations) Finally network- and pathway-based classificationapproaches share the main benefit of improving model robust-ness in cross-study analyses as compared with using individualbiomolecules as features Among these new higher-level bio-marker signatures network-based signatures accounting formolecular activities over larger and algorithm-derived networkregions may often provide more robust multifactorial markersthan signatures for smaller pathways which are typicallydefined subjectively possibly overlooking relevant functionallyrelated molecules in the surrounding network However themodel robustness will also depend on other factors eg the oc-currence of protein complexes in the studied pathwaynetwork(members of these complexes tend to have highly coordinatedactivity providing more robust averages) and the reliability andcompleteness of the specific network or pathway data sourceused (see limitations discussed in the following section)

Limitations and possible solution strategies

While pathway- and network-based analyses of omics data canenrich the biological interpretation of complex molecular alter-ation patterns and facilitate robust biomarker modeling usersshould also be aware of common limitations behind thesemethods and possibilities to address them

Although detailed functional annotations have been col-lected for a large portion of genes in humans and commonmodel organisms many identified genes still lack any func-tional assignment Consequently a significant portion of genesand proteins analyzed via current high-throughput measure-ment techniques can often not be assigned to any pathway ormolecular network for systems-level analyses Similarly sev-eral gene regulatory interactions and PPIs are still unknownand experimental techniques to identify molecular interactionsmay also generate false-positive results [54] Thus public mo-lecular interaction databases only cover a subset of biochemical

interactions occurring in living cells and the reliability of re-ported interactions varies depending on the amount and type ofassociated evidence

A further limitation of current pathway databases andassembled molecular networks is that tissue- and cell-type spe-cificity of molecular interactions and protein functions is typic-ally not considered in these data resources This can beparticularly problematic when studying diseases showing se-lective vulnerability for specific tissues or cell types eg asfound in many neurodegenerative disorders [55] Similarly inmany biomarker profiling studies and public omics data sour-ces concomitant data from different tissue and cell types for adisease of interest is lacking as well as longitudinal datarequired to identify relevant dynamic changes during diseasepathogenesis and progression Moreover a variety of limita-tions related to the specific origin processing and storage ofspecimens collected for biomarker studies may negatively im-pact subsequent analyses and the comparability of data fromdifferent studies For example studies might differ in terms ofthe sample storage durations in terms of whether the referencetissue for tumor tissue is derived from the same patient or froman unaffected control individual whether a preinfarction speci-men is accepted as a reference for the specimen obtained on in-farct suspicion or whether laser capture microdissection isapplied to isolate specific cells of interest rather than using het-erogeneous cell populations

To account for the varying reliability of public biochemicalinteraction data when assembling a molecular network variousapproaches to determine confidence scores have been proposed[56ndash59] One of the most comprehensive resources is theSTRING PPI database which allows users to filter and extractbiochemical interactions for a wide range of organisms usingdifferent sources of evidence as well as a combined confidencescore [60 61]

A frequently encountered difficulty in pathway analyses isalso the conversion of the information content of public path-way diagrams into adequate data structures for computationalinvestigations Pathway structures usually contain more com-plex layers of information than genendashgene networks eg meta-nodes representing protein complexes or gene families andcatalysis reactions represented by edges pointing to edgesDedicated computational approaches are therefore required toconvert pathway diagrams into uniform network representa-tions eg using biology-driven rules as in the RBioconductorsoftware package lsquographitersquo [62]

Apart from this conversion task building tissue-specificpathwaynetwork representations or assigning probabilities fortissue specificity to the biological interactions in an existingnetwork is a further challenging problem because interactionsare rarely confirmed experimentally in multiple specific tissuesMost bioinformatics approaches for assigning tissue specificityscores to protein interactions use publicly available gene or pro-tein expression data from large-scale tissue profiling studies[63 64] eg predicting the potential for a molecular interactionto occur in a certain tissue depending on whether both inter-action partners are expressed in this tissue [65] Moreover dedi-cated pathway- and network-analysis methods have beendeveloped for tissue-specific disease gene prioritization andgene set enrichment analysis [23 66] which may also provideuseful filters for feature selection in machine learning methodsfor sample classification

While changes in the mean or variance of the overall mo-lecular activity in pathways can provide robust biomarkers forsome diseases pathways may also contain forks and only one

Diagnostic specimen classification | 7

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

predictors the meanmedian pathway summarization theSVD- and CORG-based techniques described above and theirmethod achieved the highest predictive performance in bothwithin-data set cross-validation experiments and one out oftwo cross-data set prediction experiments (the mean summar-ization scored higher in one case)

While the techniques discussed above mostly summarizepathway activity via (weighted) averages or dimension reduc-tion approaches it was noted that disease-associated pathwayalterations may affect the variance of expression levels in path-ways more strongly than the mean Generally variance pat-terns in disease-related omics data can reflect inter-patientvariation (eg differences in the etiology or disease stage) orintra-patient fluctuations (eg onoff phases as those observedin Parkinsonrsquos disease or imbalances in the activity of specificcellular pathways) and the study of both of these sources ofvariation may be important to improve the understanding ofthe disease Glaab et al [31] therefore proposed a variance-basedapproach to identify discriminative pathway signaturesproviding a possibility to account for variation in the activity ofcellular processes for a single patient and for variation acrossmultiple individuals and combined it with diverse classificationapproaches as part of a public Web application Apart frommodifying the pathway activity representation they developeda rule-based learning method comparing activities in pairs ofpathways rather than comparing activities of single pathwaysagainst fitted threshold values to increase the robustness ofpathway-based classification models [37] Rule-based modelshave high stability with regard to small variations in the inputdata and facilitate the biological interpretation of predictionmodels by using only combinations of simple if-then-else rulesfor sample classification [38]

A further strategy to improve classifier robustness is to usediscretized pathway activity scores or counting statistics egSvenson et al [32] presented an approach that first determines

how many genes are up- or downregulated in a pathway using apredefined fold-change threshold This information was com-bined into a ratio score which equals 1 if all pathway membersare upregulated 0 if equal numbers of genes are up- anddownregulated and 1 if all genes are downregulated Aftercomputing this ratio score for all pathway gene sets and train-ing samples from a microarray case-control study they appliednearest centroids classification to the pathway ratio scores toclassify new test set samples In comparison with gene-levelclassification they report that the proposed classification ap-proach performed better for all tested training set sizes

Graph-based and hierarchical pathway activity scoring

Cellular pathway diagrams do not only capture information onthe functional relatedness of biomolecules involved in a path-way but also on the network topology of regulatory or molecu-lar interactions that link them together

Efroni et al [33] presented one of the first pathway analysisapproaches which not only scores pathway activities in omicsdata but also the consistency of molecular activities withinpathways given the topology of interactions between theirmembers Considering the input and output genes for a regula-tory interaction their probabilities of being in an under- oroverexpressed state in a condition of interest can be estimatedfrom gene expression measurements although noise in thedata may make the estimates highly uncertain A consistencyscore can then be obtained by comparing the state probabilitiesfor output genes against their inferred state probabilitiesderived from the estimated states of the interconnected inputgenes and the probability of the relevant interactions to be ac-tive or inactive Thus averaged pathway consistency scores canbe computed and used as predictive features in combinationwith pathway activity scores to classify biological samples rep-resenting different conditions Evaluating this approach on

Table 1 Overview of pathway-based machine learning approaches for supervised classification of omics samples

Publication Pathway activity scoring method Prediction method

Guo et al [27] Mean or median expression levels in GO modules Decision tree classificationTomfohr et al [28] Expression levels are summarized via the first eigenvector

from SVD analysisFocus on predictive feature selection via

t-statistics (any classifier is applicable)Lee et al [29] Normalized sum of expression levels for CORGs within

pathwaysLogistic regression

Su et al [30] Weighted sum of LLRs for pathway members Logistic regression or linear discriminantanalysis

Glaab et al [31] Variance across pathway member activity is quantified andcompared instead of averaged pathway activity

SVM random forest nearest shrunken centroidclassifier or ensemble learning

Svenson et al [32] For each gene set and patient a ratio score is defined de-pending on the number of up- and down-regulated mem-bers in the gene set

Nearest centroids classification

Efroni et al [33] Joint scoring of pathway activity and consistency usinginteractions within pathways

Bayesian linear discriminant classifier

Vaske et al [34] Probabilistic inference is used to predict pathway activitiesfrom a factor graph model of genes and their products

Cox proportional hazard regression (survivalprediction)

Breslin et al [35] Signal transduction pathways with directional interactionsare used to define the activity of a pathway based on theaverage expression of its downstream target genes

No classification is performed but contingencytables show associations between sample-wise pathway activity and clinical sampleclassifications

Kim et al [12] Hierarchical feature vectors are used assigning a two-levelhierarchical structure to the featuresgenes determinedby their pathway membership

SVM

Averaging and dimension reduction approaches are listed on top whereas graph-based and hierarchical pathway activity scoring methods are listed below the bold

black line

Diagnostic specimen classification | 3

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

microarray cancer data by using the new pathway-level featuresas input for a Bayesian Linear Discriminant Classifier the au-thors obtained similar or better predictive performance as com-pared with previously reported results for the test data

Pathway topology information is also exploited in the approachby Vaske et al [34] who converted cellular pathway diagrams intoa factor graph model describing the interactions between theinvolved biomolecules The method can integrate informationfrom different types of omics data for a biological sample and usesan Expectation Maximization (EM) algorithm to infer probabilisticpathway activity scores for each pathwaysample combinationWhen applying the approach to glioblastoma multiform gene ex-pression and copy number data and grouping patients based ontheir pathway alterations the identified subgroups displayed sig-nificant differences in quantitative survival in a KaplanndashMeier ana-lysis Robust patient stratification results were only obtained usingthe pathway-based scores and not via gene-level analyses

Apart from considering the topology of molecular interactionswithin a pathway studying alterations in all the downstream tar-gets of pathway members (ie including targets not belonging tothe pathway) may provide an additional source of information tofurther improve pathway activity scoring Breslin et al [35] pre-sented a corresponding downstream analysis approach whichdetects variation in pathway activity from the degree of correl-ation between downstream targets of a pathway Using two cor-relations scores one considering all pairs of downstream targetsand one considering only pairs without common transcriptionfactors they identified several differentially active pathways inmicroarray data from case-control studies Additionally theydefined an activity score for pathways based on the average ex-pression of their downstream target genes and showed that thesample-wise activities for several pathways are significantlyassociated with clinical sample classifications

Finally although all approaches considered so far use pathway-level sample classification as a replacement for gene-level classifi-cation these are not exclusive options Kim et al [12] proposed tobuild biomarker models from gene expression data at two levelsthe gene and pathway level via a hierarchical feature structureThis structure organizes the individual gene-level features into se-cond-level aggregate features (pathways) used by a support vectormachine (SVM) algorithm to classify samples The authors con-sidered three alternative approaches GLEG (GSEA-based LeadingEdge Gene feature method using predictor genes derived fromgene set enrichment analysis [19] which as a group maximally dif-ferentiate between the sample classes) GPF (GSEA Pathway Featuremethod using enriched pathways as features) and SPF (SVM-basedPathway Feature method using pathway features derived fromSVM) When comparing these methods against classical gene-leveland random gene set predictors using cross-validation and cross-study prediction on microarray cancer data in the majority ofcases the newly proposed methods achieved better results

In summary a wide range of pathway activity inferencemethods and machine learning approaches are available whichcan be combined effectively for pathway-based diagnosticclassification of biological samples While pathway-basedprediction does not necessarily provide superior predictive per-formance in within-data set cross-validation experimentsimproved robustness and accuracy was observed consistentlyin cross-study prediction tasks

Network-based biomarker modeling

Although manually curated pathways have many benefits forthe biological interpretation of large-scale omics data in living

cells metabolic and signaling pathways are not isolated butinterconnected within large and complex molecular and regula-tory networks These networks often include several genes pro-teins or metabolites that are not annotated for any pathwayand therefore ignored by pathway-based analysis methodsConsequently to identify disease-associated modules of inter-connected biomolecules in a more unbiased manner (ie with-out restricting the search space to biomolecules with knownpathway annotations) network-based analysis methods havebeen introduced While pathway-based approaches for bio-marker modeling may have advantages in terms of model inter-pretability the search space exploration in network-basedbiomarker discovery is not restricted by subjectively definedpathway boundaries and the genome-scale molecular networksused as input typically cover significantly larger numbers of bio-molecules than all combined pathways Nevertheless similar tosubjectively defined pathways networks assembled from publicdata sources suffer from various limitations eg missing mo-lecular interactions and lack of tissue-specific annotations andthese issues have to be addressed by dedicated methods (seesection on lsquolimitations and possible solution strategiesrsquo below)In the following two main types of network-based modelingapproaches will be discussed First two-step sequentialapproaches which score the activity in molecular subnetworksand afterward use these activities for predictive machine learn-ing and secondly one-step network analysis approacheswhich exploit network topology information directly within thepredictive model building

Two-step network activity scoring and predictionapproaches

Network activity over multiple interconnected biomoleculescan be summarized and scored using similar averaging or di-mension reduction approaches as in pathway activity scoringmethods However in contrast to the straightforward use ofpredefined pathway definitions first a molecular or regulatorynetwork has to be assembled or reconstructed using either pub-lic molecular interaction databases or applying network infer-ence methods to omics data (in Table 2 an overview of differentmethodologies is shown which are discussed in the following)

A first method to construct new sample-specific gene regu-latory networks for transcriptomics sample classification wasproposed by Tuck et al [39] The networks were generated bydetermining the graph-theoretic intersection between a staticconnectivity network (representing transcription factor bindingto gene promoter regions) obtained using data from theTRANSFAC database [50] with sample-specific coexpressionnetworks (representing transcription-factorndashtarget genecoexpression) derived from gene expression data To extractdiscriminative features for diagnostic specimen classificationfrom these networks they proposed a link-based classificationapproach comparing the activity status of gene regulatoryinteractions (called lsquolinksrsquo) across different sample groups anda degree-based classification method comparing topologicalcentrality measures [51] for the networks When testing theseapproaches on data from different cancer case-control studieshigh cross-validated accuracies were reported for both cell typeand patient sample classification Moreover the network-basedanalysis enabled the authors to identify key transcriptionalregulators altered under specific disease conditions

Instead of constructing new regulatory networks discrim-inative disease-associated network alterations can also be iden-tified by computationally mapping omics data onto in silico

4 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

representations of biochemical proteinndashprotein interaction (PPI)networks Ma et al [40] developed a corresponding approach toobtain more reliable disease association scores for genes by ex-ploiting neighborhood information from a PPI network Theyused a modified Pearson correlation coefficient to assess the as-sociation between microarray gene expression and numeric val-ues encoding the disease status of the samples (taking intoconsideration that these phenotype values may not have a nor-mal distribution) and assigned the Fisher-transformed gene-phenotype association scores to the corresponding proteins in aPPI Next they recalibrate these association scores by modelingthe underlying true scores for each gene using Markov RandomField theory [52] reestimating their values from weighted con-tributions of their network neighborsrsquo original associationscores (the weights are determined according to different net-work neighborhood definitions using either direct neighborsshortest path or diffusion kernel neighborhoods see [40] fordetails) When evaluating the utility of the recalibrated scoresfor disease gene prioritizations on microarray data using knownGene Ontology functional annotations conventional

prioritization approaches using only gene expression or PPI datawere outperformed (although the scoring approach could alsobe used for predictive model building this particular applicationwas not considered)

While the approach by Ma et al focuses on improving thedisease-association scores for individual genes Chuang et al[41] presented a method identifying and scoring entire disease-related subnetworks similar to their pathway association scor-ing approach discussed above (see Lee et al [29]) After comput-ing the mutual information (MI) between sample phenotypevalues (encoding the presence or absence of a disease) and dis-cretized expression values for each gene from a microarray dataset assigned to the proteins in a PPI they applied a greedysearch to expand the seed nodes in the network with locallymaximal MI scores Specifically each seed node was expandedsuch that the sum of scores for the expanded network moduleis maximized (the search stops when no extension increasesthe total score above a predefined improvement rate) Whentraining logistic regression classifiers on the normalized andaveraged activities of the resulting subnetworks for breast

Table 2 Overview of network-based methods for machine learning analysis of omics data sequential network activity scoring and predictionmethods are shown on top whereas machine learning approaches using embedded network-based feature selection are listed below the boldblack line

Methodologypublication

Network activityalteration scoring method Prediction method

Tuck et al [39] Sample-specific gene regulatory networks are constructed andsubnetwork activity is scored by summing over activeinteractions

Nearest neighbors decision tree NaıveBayes among others

Ma et al [40] Disease association is scored for genes based on gene expres-sion data and their neighborsrsquo association scores in a PPI net-work using Markov Random Field theory

The approach is evaluated for disease geneprioritization but is applicable for predict-ive feature selection in combination withany prediction method

Chuang et al [41] Normalized gene expression data is mapped onto a proteininteraction network and discriminative subnetworks areidentified via a greedy search procedure

Logistic regression

Taylor et al [42] Hub nodes in protein interaction networks are determined andthe relative gene expression of hubs with each of their inter-acting partners is computed to identify hubs with diverserelative expression across sample groups

Affinity propagation clustering is used to as-sign a probability of poor prognosis tobreast cancer patients

Petrochilos et al [43] A random walk community detection algorithm is applied todiscover modules in a molecular interaction network andgene expression data is used to identify disease-associatedmodules

The approach is used to identify cancer-asso-ciated network modules and validated byscoring the enrichment of known cancer-related genes extracted from the OMIMdatabase

Rapaport et al [44] Spectral decomposition of gene expression profiles is appliedwith respect to the eigenfunctions of a network graphattenuating the high-frequency components of the expres-sion profiles with respect to the graph topology

SVM

Li et al [45] A network-constrained regularization procedure for linear re-gression analysis is used to identify disease-related discrim-inative subnetworks

Penalized linear regression

Yang et al [46] Three machine learning methods for graph-guided feature se-lection and grouping are proposed including a convex func-tion and two non-convex formulations designed to reducethe estimation bias

Penalized least squares-based approach(GOSCAR Graph octagonal shrinkage andclustering algorithm for regression)

Lorbert et al [47 48] A sparse regression approach is proposed using the PEN pen-alty to favor the grouping of strongly correlated featuresbased on pairwise similarities (eg derived from a molecularinteraction graph)

Penalized regression (PEN penalty)

Vlassis et al [49] Penalized logistic regression is applied using a convex PEN pen-alty function (see approach by Lorbert et al) with absolutefeature weights to better reflect the relevance of discrimina-tive genes in the feature selection

Penalized logistic regression (PEN penaltywith absolute feature weights)

Diagnostic specimen classification | 5

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer data the authors found that the subnetwork markerswere more reproducible than single-gene markers andprovided higher accuracy in distinguishing metastatic fromnonmetastatic tumors

As an intermediate solution between focusing on individualbiomolecules and entire network modules Taylor et al [42] pro-posed a method that investigates network nodes with outstand-ing topological properties and their direct neighbors Aftercomputationally mapping breast cancer gene expression dataonto the in silico representation of a PPI network they deter-mined proteins with large numbers of biochemical interactionpartners (so-called lsquohub nodesrsquo) and computed their relative ex-pression compared with each of these interacting partnersThey then determined for which hubs the relative expressiondiffered significantly between long-term survivors and patientswho died from the disease and applied a clustering approach toassign a probability of poor prognosis to new patient samples(the specific method used is known as lsquoaffinity propagationclusteringrsquo in the literature) The approach was evaluated using5-fold cross-validation providing accuracy sensitivity and spe-cificity estimates that compared favorably with the reportedresults for commercially available genomic breast cancerdiagnostics

Instead of considering topological properties of individualnodes in a molecular network information from a networkgraph can also be extracted via algorithms for finding sub-graphs which stand out in terms of their high density ofmolecular interactions (using approaches referred to as lsquocom-munity identificationrsquo or lsquograph clusteringrsquo methods in the lit-erature) Petrochilos et al [43] proposed a correspondingapproach which first applies a graph-based random walk algo-rithm on a genome-scale molecular network Information fromcancer-related gene expression data was then integrated intothe network by setting the weight of each network node as themaximum fold change of probes corresponding to its gene sym-bol (weights for biochemical interactions are determined by thesquare of the mean of the absolute fold changes of the relevantinteraction partners) Finally the score of a network module ofconnected nodes was obtained by comparing its cumulative ac-tivity (ie the square of the average weighted expression for allits nodes) against a bootstrap distribution of cumulative activ-ities obtained via random sampling of a matched number offold change values When testing the enrichment of known can-cer genes in the top-scored network modules identified withthis approach a similar or better performance was reached incomparison with other widely used module-finding algorithms(potential alternative applications of the identified modules forbiomarker modeling were not assessed in this publication)

Apart from averaging molecular activities over networkneighborhoods or using community identification methods sig-nal processing techniques may provide a further means to gleanuseful information from a network for predictive model build-ing as shown in an approach by Rapaport et al [44] They madeuse of the observation that genes in close proximity to eachother in a network tend to have similar expression and pro-posed to denoise microarray measurements by removing theirhigh-frequency component over the network For this purposespectral decomposition of gene expression profiles with respectto a molecular network graph was applied followed by attenu-ating high-frequency signal components expected to representmeasurement noise The method was evaluated for supervisedanalysis of irradiated and nonirradiated yeast strains using aSVM providing similar classification performance as a modelbuilt without the network-based filtering but facilitating

biological data interpretation by grouping the selected biomol-ecules according to their participation in the network modules

One-step machine learning approaches for networkanalysis

In contrast to the network analysis approaches considered sofar which apply network feature extraction and predictive ma-chine learning analysis in separate steps more recently one-step network-based feature selection approaches have beenproposed integrating the attribute selection directly into thepredictive model building Most of these approaches formulatethe model building task as an optimization problem formula-tion in which the objective function for classification or regres-sion is extended by a penalty term promoting the selection ofgrouped features in a molecular network (this strategy is alsoreferred to as network-constraint regularization)

Li et al [45] proposed one of the first correspondingapproaches by adding a penalty term to linear regression incor-porating network information into the analysis via theLaplacian matrix of the network graph The approach penalizesthe L1-norm of the feature weights and encourages a smoothprofile of weights over neighboring nodes in the networkHowever Binder and Schumacher later reported that themethod has lower performance than a null model ie a modelnot using any covariate information [53] As possible explan-ations they note that Li et al discarded censored observationsand about 20 000 variables that could not be assigned to corres-ponding nodes in the molecular interaction network (see sec-tion on lsquolimitationsrsquo below) Yang et al [46] suggested that thepreviously used network grouping penalties can introduce add-itional estimation bias into the model when the coefficientsigns for two features connected in the graph are differentThey presented alternative penalties to achieve network group-ing and sparse feature selection in particular two non-convexpenalties which shrink only small differences in absolute val-ues of feature weights to reduce the estimation bias [46] In ex-periments on synthetic data and two real data sets the newapproaches outperformed previous feature grouping methods

However with non-convex penalties finding global optimalsolutions is often not feasible and even identifying good localoptima may require high computational effort Lorbert et al [4748] proposed an alternative generic convex penalty thePairwise Elastic Net (PEN) that provides sparse feature selectionand promotes the grouping of attributes according to auser-defined feature similarity measure (eg obtained from bio-chemical interaction weights in a molecular network) PEN is ageneralization of the Elastic Net a method providing a trade-offbetween L1- and L2-penalized regressions by an adjustable par-ameter In PEN this parameter can be replaced to determine thetrade-off using additional information from an attributesimilarity matrix (different instances of PEN can be defined aslong as the similarity matrix is positive semidefinite and non-negative) Comparing PEN against other popular machine learn-ing approaches on simulated data with a grouping structureamong the features PEN achieved a competitive mean squarederror (MSE) and provided sparser solutions than approacheswith similar MSE

More recently Vlassis et al introduced a new instance ofPEN which penalizes differences between the lsquoabsolutersquo valuesof weights of interlinked features in a network graph The mo-tivation behind this approach termed GenePEN is that themagnitude of a weight in a linear model reflects the predictivevalue of the corresponding variable so that the weights for

6 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

irrelevant features are driven to zero by the penalty By ensuringthe convexity of the penalty function global optimal solutionscan be identified efficiently with existing optimization frame-works When evaluating GenePEN on simulated data andreal-word microarray data sets in comparison with other classi-fication methods using feature grouping the method providedsimilar predictive power and gene selections sharing signifi-cantly more connections within a molecular interaction net-work Visualization of the corresponding subnetworks enableda biological interpretation of disease-affected network regionswhich were enriched in known disease-related genes obtainedfrom literature mining

Overall network-based sample classification methods pro-vide a new means to analyze complex omics data sets allowingresearchers to identify coherent molecular network alterationsunder different biological conditions Identifying such network-level patterns in omics data for diseases with complexmolecular manifestations can shed new light on the molecularmechanisms of the disease and facilitate the development ofrobust multifactorial biomarker signatures

In contrast to single-molecule-based biomarker modeling anetwork-level approach has the potential to capture diverse fac-ets of a heterogenic disease reflected via alteration patterns indifferent network regions In comparison with pathway-basedmachine learning approaches methods using genome-scalednetworks as prior knowledge may produce models that are moredifficult to interpret biologically but which can identify a muchwider range of alterations in cellular processes (covering manygenes proteins or metabolites without any known pathway an-notations) Finally network- and pathway-based classificationapproaches share the main benefit of improving model robust-ness in cross-study analyses as compared with using individualbiomolecules as features Among these new higher-level bio-marker signatures network-based signatures accounting formolecular activities over larger and algorithm-derived networkregions may often provide more robust multifactorial markersthan signatures for smaller pathways which are typicallydefined subjectively possibly overlooking relevant functionallyrelated molecules in the surrounding network However themodel robustness will also depend on other factors eg the oc-currence of protein complexes in the studied pathwaynetwork(members of these complexes tend to have highly coordinatedactivity providing more robust averages) and the reliability andcompleteness of the specific network or pathway data sourceused (see limitations discussed in the following section)

Limitations and possible solution strategies

While pathway- and network-based analyses of omics data canenrich the biological interpretation of complex molecular alter-ation patterns and facilitate robust biomarker modeling usersshould also be aware of common limitations behind thesemethods and possibilities to address them

Although detailed functional annotations have been col-lected for a large portion of genes in humans and commonmodel organisms many identified genes still lack any func-tional assignment Consequently a significant portion of genesand proteins analyzed via current high-throughput measure-ment techniques can often not be assigned to any pathway ormolecular network for systems-level analyses Similarly sev-eral gene regulatory interactions and PPIs are still unknownand experimental techniques to identify molecular interactionsmay also generate false-positive results [54] Thus public mo-lecular interaction databases only cover a subset of biochemical

interactions occurring in living cells and the reliability of re-ported interactions varies depending on the amount and type ofassociated evidence

A further limitation of current pathway databases andassembled molecular networks is that tissue- and cell-type spe-cificity of molecular interactions and protein functions is typic-ally not considered in these data resources This can beparticularly problematic when studying diseases showing se-lective vulnerability for specific tissues or cell types eg asfound in many neurodegenerative disorders [55] Similarly inmany biomarker profiling studies and public omics data sour-ces concomitant data from different tissue and cell types for adisease of interest is lacking as well as longitudinal datarequired to identify relevant dynamic changes during diseasepathogenesis and progression Moreover a variety of limita-tions related to the specific origin processing and storage ofspecimens collected for biomarker studies may negatively im-pact subsequent analyses and the comparability of data fromdifferent studies For example studies might differ in terms ofthe sample storage durations in terms of whether the referencetissue for tumor tissue is derived from the same patient or froman unaffected control individual whether a preinfarction speci-men is accepted as a reference for the specimen obtained on in-farct suspicion or whether laser capture microdissection isapplied to isolate specific cells of interest rather than using het-erogeneous cell populations

To account for the varying reliability of public biochemicalinteraction data when assembling a molecular network variousapproaches to determine confidence scores have been proposed[56ndash59] One of the most comprehensive resources is theSTRING PPI database which allows users to filter and extractbiochemical interactions for a wide range of organisms usingdifferent sources of evidence as well as a combined confidencescore [60 61]

A frequently encountered difficulty in pathway analyses isalso the conversion of the information content of public path-way diagrams into adequate data structures for computationalinvestigations Pathway structures usually contain more com-plex layers of information than genendashgene networks eg meta-nodes representing protein complexes or gene families andcatalysis reactions represented by edges pointing to edgesDedicated computational approaches are therefore required toconvert pathway diagrams into uniform network representa-tions eg using biology-driven rules as in the RBioconductorsoftware package lsquographitersquo [62]

Apart from this conversion task building tissue-specificpathwaynetwork representations or assigning probabilities fortissue specificity to the biological interactions in an existingnetwork is a further challenging problem because interactionsare rarely confirmed experimentally in multiple specific tissuesMost bioinformatics approaches for assigning tissue specificityscores to protein interactions use publicly available gene or pro-tein expression data from large-scale tissue profiling studies[63 64] eg predicting the potential for a molecular interactionto occur in a certain tissue depending on whether both inter-action partners are expressed in this tissue [65] Moreover dedi-cated pathway- and network-analysis methods have beendeveloped for tissue-specific disease gene prioritization andgene set enrichment analysis [23 66] which may also provideuseful filters for feature selection in machine learning methodsfor sample classification

While changes in the mean or variance of the overall mo-lecular activity in pathways can provide robust biomarkers forsome diseases pathways may also contain forks and only one

Diagnostic specimen classification | 7

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

microarray cancer data by using the new pathway-level featuresas input for a Bayesian Linear Discriminant Classifier the au-thors obtained similar or better predictive performance as com-pared with previously reported results for the test data

Pathway topology information is also exploited in the approachby Vaske et al [34] who converted cellular pathway diagrams intoa factor graph model describing the interactions between theinvolved biomolecules The method can integrate informationfrom different types of omics data for a biological sample and usesan Expectation Maximization (EM) algorithm to infer probabilisticpathway activity scores for each pathwaysample combinationWhen applying the approach to glioblastoma multiform gene ex-pression and copy number data and grouping patients based ontheir pathway alterations the identified subgroups displayed sig-nificant differences in quantitative survival in a KaplanndashMeier ana-lysis Robust patient stratification results were only obtained usingthe pathway-based scores and not via gene-level analyses

Apart from considering the topology of molecular interactionswithin a pathway studying alterations in all the downstream tar-gets of pathway members (ie including targets not belonging tothe pathway) may provide an additional source of information tofurther improve pathway activity scoring Breslin et al [35] pre-sented a corresponding downstream analysis approach whichdetects variation in pathway activity from the degree of correl-ation between downstream targets of a pathway Using two cor-relations scores one considering all pairs of downstream targetsand one considering only pairs without common transcriptionfactors they identified several differentially active pathways inmicroarray data from case-control studies Additionally theydefined an activity score for pathways based on the average ex-pression of their downstream target genes and showed that thesample-wise activities for several pathways are significantlyassociated with clinical sample classifications

Finally although all approaches considered so far use pathway-level sample classification as a replacement for gene-level classifi-cation these are not exclusive options Kim et al [12] proposed tobuild biomarker models from gene expression data at two levelsthe gene and pathway level via a hierarchical feature structureThis structure organizes the individual gene-level features into se-cond-level aggregate features (pathways) used by a support vectormachine (SVM) algorithm to classify samples The authors con-sidered three alternative approaches GLEG (GSEA-based LeadingEdge Gene feature method using predictor genes derived fromgene set enrichment analysis [19] which as a group maximally dif-ferentiate between the sample classes) GPF (GSEA Pathway Featuremethod using enriched pathways as features) and SPF (SVM-basedPathway Feature method using pathway features derived fromSVM) When comparing these methods against classical gene-leveland random gene set predictors using cross-validation and cross-study prediction on microarray cancer data in the majority ofcases the newly proposed methods achieved better results

In summary a wide range of pathway activity inferencemethods and machine learning approaches are available whichcan be combined effectively for pathway-based diagnosticclassification of biological samples While pathway-basedprediction does not necessarily provide superior predictive per-formance in within-data set cross-validation experimentsimproved robustness and accuracy was observed consistentlyin cross-study prediction tasks

Network-based biomarker modeling

Although manually curated pathways have many benefits forthe biological interpretation of large-scale omics data in living

cells metabolic and signaling pathways are not isolated butinterconnected within large and complex molecular and regula-tory networks These networks often include several genes pro-teins or metabolites that are not annotated for any pathwayand therefore ignored by pathway-based analysis methodsConsequently to identify disease-associated modules of inter-connected biomolecules in a more unbiased manner (ie with-out restricting the search space to biomolecules with knownpathway annotations) network-based analysis methods havebeen introduced While pathway-based approaches for bio-marker modeling may have advantages in terms of model inter-pretability the search space exploration in network-basedbiomarker discovery is not restricted by subjectively definedpathway boundaries and the genome-scale molecular networksused as input typically cover significantly larger numbers of bio-molecules than all combined pathways Nevertheless similar tosubjectively defined pathways networks assembled from publicdata sources suffer from various limitations eg missing mo-lecular interactions and lack of tissue-specific annotations andthese issues have to be addressed by dedicated methods (seesection on lsquolimitations and possible solution strategiesrsquo below)In the following two main types of network-based modelingapproaches will be discussed First two-step sequentialapproaches which score the activity in molecular subnetworksand afterward use these activities for predictive machine learn-ing and secondly one-step network analysis approacheswhich exploit network topology information directly within thepredictive model building

Two-step network activity scoring and predictionapproaches

Network activity over multiple interconnected biomoleculescan be summarized and scored using similar averaging or di-mension reduction approaches as in pathway activity scoringmethods However in contrast to the straightforward use ofpredefined pathway definitions first a molecular or regulatorynetwork has to be assembled or reconstructed using either pub-lic molecular interaction databases or applying network infer-ence methods to omics data (in Table 2 an overview of differentmethodologies is shown which are discussed in the following)

A first method to construct new sample-specific gene regu-latory networks for transcriptomics sample classification wasproposed by Tuck et al [39] The networks were generated bydetermining the graph-theoretic intersection between a staticconnectivity network (representing transcription factor bindingto gene promoter regions) obtained using data from theTRANSFAC database [50] with sample-specific coexpressionnetworks (representing transcription-factorndashtarget genecoexpression) derived from gene expression data To extractdiscriminative features for diagnostic specimen classificationfrom these networks they proposed a link-based classificationapproach comparing the activity status of gene regulatoryinteractions (called lsquolinksrsquo) across different sample groups anda degree-based classification method comparing topologicalcentrality measures [51] for the networks When testing theseapproaches on data from different cancer case-control studieshigh cross-validated accuracies were reported for both cell typeand patient sample classification Moreover the network-basedanalysis enabled the authors to identify key transcriptionalregulators altered under specific disease conditions

Instead of constructing new regulatory networks discrim-inative disease-associated network alterations can also be iden-tified by computationally mapping omics data onto in silico

4 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

representations of biochemical proteinndashprotein interaction (PPI)networks Ma et al [40] developed a corresponding approach toobtain more reliable disease association scores for genes by ex-ploiting neighborhood information from a PPI network Theyused a modified Pearson correlation coefficient to assess the as-sociation between microarray gene expression and numeric val-ues encoding the disease status of the samples (taking intoconsideration that these phenotype values may not have a nor-mal distribution) and assigned the Fisher-transformed gene-phenotype association scores to the corresponding proteins in aPPI Next they recalibrate these association scores by modelingthe underlying true scores for each gene using Markov RandomField theory [52] reestimating their values from weighted con-tributions of their network neighborsrsquo original associationscores (the weights are determined according to different net-work neighborhood definitions using either direct neighborsshortest path or diffusion kernel neighborhoods see [40] fordetails) When evaluating the utility of the recalibrated scoresfor disease gene prioritizations on microarray data using knownGene Ontology functional annotations conventional

prioritization approaches using only gene expression or PPI datawere outperformed (although the scoring approach could alsobe used for predictive model building this particular applicationwas not considered)

While the approach by Ma et al focuses on improving thedisease-association scores for individual genes Chuang et al[41] presented a method identifying and scoring entire disease-related subnetworks similar to their pathway association scor-ing approach discussed above (see Lee et al [29]) After comput-ing the mutual information (MI) between sample phenotypevalues (encoding the presence or absence of a disease) and dis-cretized expression values for each gene from a microarray dataset assigned to the proteins in a PPI they applied a greedysearch to expand the seed nodes in the network with locallymaximal MI scores Specifically each seed node was expandedsuch that the sum of scores for the expanded network moduleis maximized (the search stops when no extension increasesthe total score above a predefined improvement rate) Whentraining logistic regression classifiers on the normalized andaveraged activities of the resulting subnetworks for breast

Table 2 Overview of network-based methods for machine learning analysis of omics data sequential network activity scoring and predictionmethods are shown on top whereas machine learning approaches using embedded network-based feature selection are listed below the boldblack line

Methodologypublication

Network activityalteration scoring method Prediction method

Tuck et al [39] Sample-specific gene regulatory networks are constructed andsubnetwork activity is scored by summing over activeinteractions

Nearest neighbors decision tree NaıveBayes among others

Ma et al [40] Disease association is scored for genes based on gene expres-sion data and their neighborsrsquo association scores in a PPI net-work using Markov Random Field theory

The approach is evaluated for disease geneprioritization but is applicable for predict-ive feature selection in combination withany prediction method

Chuang et al [41] Normalized gene expression data is mapped onto a proteininteraction network and discriminative subnetworks areidentified via a greedy search procedure

Logistic regression

Taylor et al [42] Hub nodes in protein interaction networks are determined andthe relative gene expression of hubs with each of their inter-acting partners is computed to identify hubs with diverserelative expression across sample groups

Affinity propagation clustering is used to as-sign a probability of poor prognosis tobreast cancer patients

Petrochilos et al [43] A random walk community detection algorithm is applied todiscover modules in a molecular interaction network andgene expression data is used to identify disease-associatedmodules

The approach is used to identify cancer-asso-ciated network modules and validated byscoring the enrichment of known cancer-related genes extracted from the OMIMdatabase

Rapaport et al [44] Spectral decomposition of gene expression profiles is appliedwith respect to the eigenfunctions of a network graphattenuating the high-frequency components of the expres-sion profiles with respect to the graph topology

SVM

Li et al [45] A network-constrained regularization procedure for linear re-gression analysis is used to identify disease-related discrim-inative subnetworks

Penalized linear regression

Yang et al [46] Three machine learning methods for graph-guided feature se-lection and grouping are proposed including a convex func-tion and two non-convex formulations designed to reducethe estimation bias

Penalized least squares-based approach(GOSCAR Graph octagonal shrinkage andclustering algorithm for regression)

Lorbert et al [47 48] A sparse regression approach is proposed using the PEN pen-alty to favor the grouping of strongly correlated featuresbased on pairwise similarities (eg derived from a molecularinteraction graph)

Penalized regression (PEN penalty)

Vlassis et al [49] Penalized logistic regression is applied using a convex PEN pen-alty function (see approach by Lorbert et al) with absolutefeature weights to better reflect the relevance of discrimina-tive genes in the feature selection

Penalized logistic regression (PEN penaltywith absolute feature weights)

Diagnostic specimen classification | 5

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer data the authors found that the subnetwork markerswere more reproducible than single-gene markers andprovided higher accuracy in distinguishing metastatic fromnonmetastatic tumors

As an intermediate solution between focusing on individualbiomolecules and entire network modules Taylor et al [42] pro-posed a method that investigates network nodes with outstand-ing topological properties and their direct neighbors Aftercomputationally mapping breast cancer gene expression dataonto the in silico representation of a PPI network they deter-mined proteins with large numbers of biochemical interactionpartners (so-called lsquohub nodesrsquo) and computed their relative ex-pression compared with each of these interacting partnersThey then determined for which hubs the relative expressiondiffered significantly between long-term survivors and patientswho died from the disease and applied a clustering approach toassign a probability of poor prognosis to new patient samples(the specific method used is known as lsquoaffinity propagationclusteringrsquo in the literature) The approach was evaluated using5-fold cross-validation providing accuracy sensitivity and spe-cificity estimates that compared favorably with the reportedresults for commercially available genomic breast cancerdiagnostics

Instead of considering topological properties of individualnodes in a molecular network information from a networkgraph can also be extracted via algorithms for finding sub-graphs which stand out in terms of their high density ofmolecular interactions (using approaches referred to as lsquocom-munity identificationrsquo or lsquograph clusteringrsquo methods in the lit-erature) Petrochilos et al [43] proposed a correspondingapproach which first applies a graph-based random walk algo-rithm on a genome-scale molecular network Information fromcancer-related gene expression data was then integrated intothe network by setting the weight of each network node as themaximum fold change of probes corresponding to its gene sym-bol (weights for biochemical interactions are determined by thesquare of the mean of the absolute fold changes of the relevantinteraction partners) Finally the score of a network module ofconnected nodes was obtained by comparing its cumulative ac-tivity (ie the square of the average weighted expression for allits nodes) against a bootstrap distribution of cumulative activ-ities obtained via random sampling of a matched number offold change values When testing the enrichment of known can-cer genes in the top-scored network modules identified withthis approach a similar or better performance was reached incomparison with other widely used module-finding algorithms(potential alternative applications of the identified modules forbiomarker modeling were not assessed in this publication)

Apart from averaging molecular activities over networkneighborhoods or using community identification methods sig-nal processing techniques may provide a further means to gleanuseful information from a network for predictive model build-ing as shown in an approach by Rapaport et al [44] They madeuse of the observation that genes in close proximity to eachother in a network tend to have similar expression and pro-posed to denoise microarray measurements by removing theirhigh-frequency component over the network For this purposespectral decomposition of gene expression profiles with respectto a molecular network graph was applied followed by attenu-ating high-frequency signal components expected to representmeasurement noise The method was evaluated for supervisedanalysis of irradiated and nonirradiated yeast strains using aSVM providing similar classification performance as a modelbuilt without the network-based filtering but facilitating

biological data interpretation by grouping the selected biomol-ecules according to their participation in the network modules

One-step machine learning approaches for networkanalysis

In contrast to the network analysis approaches considered sofar which apply network feature extraction and predictive ma-chine learning analysis in separate steps more recently one-step network-based feature selection approaches have beenproposed integrating the attribute selection directly into thepredictive model building Most of these approaches formulatethe model building task as an optimization problem formula-tion in which the objective function for classification or regres-sion is extended by a penalty term promoting the selection ofgrouped features in a molecular network (this strategy is alsoreferred to as network-constraint regularization)

Li et al [45] proposed one of the first correspondingapproaches by adding a penalty term to linear regression incor-porating network information into the analysis via theLaplacian matrix of the network graph The approach penalizesthe L1-norm of the feature weights and encourages a smoothprofile of weights over neighboring nodes in the networkHowever Binder and Schumacher later reported that themethod has lower performance than a null model ie a modelnot using any covariate information [53] As possible explan-ations they note that Li et al discarded censored observationsand about 20 000 variables that could not be assigned to corres-ponding nodes in the molecular interaction network (see sec-tion on lsquolimitationsrsquo below) Yang et al [46] suggested that thepreviously used network grouping penalties can introduce add-itional estimation bias into the model when the coefficientsigns for two features connected in the graph are differentThey presented alternative penalties to achieve network group-ing and sparse feature selection in particular two non-convexpenalties which shrink only small differences in absolute val-ues of feature weights to reduce the estimation bias [46] In ex-periments on synthetic data and two real data sets the newapproaches outperformed previous feature grouping methods

However with non-convex penalties finding global optimalsolutions is often not feasible and even identifying good localoptima may require high computational effort Lorbert et al [4748] proposed an alternative generic convex penalty thePairwise Elastic Net (PEN) that provides sparse feature selectionand promotes the grouping of attributes according to auser-defined feature similarity measure (eg obtained from bio-chemical interaction weights in a molecular network) PEN is ageneralization of the Elastic Net a method providing a trade-offbetween L1- and L2-penalized regressions by an adjustable par-ameter In PEN this parameter can be replaced to determine thetrade-off using additional information from an attributesimilarity matrix (different instances of PEN can be defined aslong as the similarity matrix is positive semidefinite and non-negative) Comparing PEN against other popular machine learn-ing approaches on simulated data with a grouping structureamong the features PEN achieved a competitive mean squarederror (MSE) and provided sparser solutions than approacheswith similar MSE

More recently Vlassis et al introduced a new instance ofPEN which penalizes differences between the lsquoabsolutersquo valuesof weights of interlinked features in a network graph The mo-tivation behind this approach termed GenePEN is that themagnitude of a weight in a linear model reflects the predictivevalue of the corresponding variable so that the weights for

6 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

irrelevant features are driven to zero by the penalty By ensuringthe convexity of the penalty function global optimal solutionscan be identified efficiently with existing optimization frame-works When evaluating GenePEN on simulated data andreal-word microarray data sets in comparison with other classi-fication methods using feature grouping the method providedsimilar predictive power and gene selections sharing signifi-cantly more connections within a molecular interaction net-work Visualization of the corresponding subnetworks enableda biological interpretation of disease-affected network regionswhich were enriched in known disease-related genes obtainedfrom literature mining

Overall network-based sample classification methods pro-vide a new means to analyze complex omics data sets allowingresearchers to identify coherent molecular network alterationsunder different biological conditions Identifying such network-level patterns in omics data for diseases with complexmolecular manifestations can shed new light on the molecularmechanisms of the disease and facilitate the development ofrobust multifactorial biomarker signatures

In contrast to single-molecule-based biomarker modeling anetwork-level approach has the potential to capture diverse fac-ets of a heterogenic disease reflected via alteration patterns indifferent network regions In comparison with pathway-basedmachine learning approaches methods using genome-scalednetworks as prior knowledge may produce models that are moredifficult to interpret biologically but which can identify a muchwider range of alterations in cellular processes (covering manygenes proteins or metabolites without any known pathway an-notations) Finally network- and pathway-based classificationapproaches share the main benefit of improving model robust-ness in cross-study analyses as compared with using individualbiomolecules as features Among these new higher-level bio-marker signatures network-based signatures accounting formolecular activities over larger and algorithm-derived networkregions may often provide more robust multifactorial markersthan signatures for smaller pathways which are typicallydefined subjectively possibly overlooking relevant functionallyrelated molecules in the surrounding network However themodel robustness will also depend on other factors eg the oc-currence of protein complexes in the studied pathwaynetwork(members of these complexes tend to have highly coordinatedactivity providing more robust averages) and the reliability andcompleteness of the specific network or pathway data sourceused (see limitations discussed in the following section)

Limitations and possible solution strategies

While pathway- and network-based analyses of omics data canenrich the biological interpretation of complex molecular alter-ation patterns and facilitate robust biomarker modeling usersshould also be aware of common limitations behind thesemethods and possibilities to address them

Although detailed functional annotations have been col-lected for a large portion of genes in humans and commonmodel organisms many identified genes still lack any func-tional assignment Consequently a significant portion of genesand proteins analyzed via current high-throughput measure-ment techniques can often not be assigned to any pathway ormolecular network for systems-level analyses Similarly sev-eral gene regulatory interactions and PPIs are still unknownand experimental techniques to identify molecular interactionsmay also generate false-positive results [54] Thus public mo-lecular interaction databases only cover a subset of biochemical

interactions occurring in living cells and the reliability of re-ported interactions varies depending on the amount and type ofassociated evidence

A further limitation of current pathway databases andassembled molecular networks is that tissue- and cell-type spe-cificity of molecular interactions and protein functions is typic-ally not considered in these data resources This can beparticularly problematic when studying diseases showing se-lective vulnerability for specific tissues or cell types eg asfound in many neurodegenerative disorders [55] Similarly inmany biomarker profiling studies and public omics data sour-ces concomitant data from different tissue and cell types for adisease of interest is lacking as well as longitudinal datarequired to identify relevant dynamic changes during diseasepathogenesis and progression Moreover a variety of limita-tions related to the specific origin processing and storage ofspecimens collected for biomarker studies may negatively im-pact subsequent analyses and the comparability of data fromdifferent studies For example studies might differ in terms ofthe sample storage durations in terms of whether the referencetissue for tumor tissue is derived from the same patient or froman unaffected control individual whether a preinfarction speci-men is accepted as a reference for the specimen obtained on in-farct suspicion or whether laser capture microdissection isapplied to isolate specific cells of interest rather than using het-erogeneous cell populations

To account for the varying reliability of public biochemicalinteraction data when assembling a molecular network variousapproaches to determine confidence scores have been proposed[56ndash59] One of the most comprehensive resources is theSTRING PPI database which allows users to filter and extractbiochemical interactions for a wide range of organisms usingdifferent sources of evidence as well as a combined confidencescore [60 61]

A frequently encountered difficulty in pathway analyses isalso the conversion of the information content of public path-way diagrams into adequate data structures for computationalinvestigations Pathway structures usually contain more com-plex layers of information than genendashgene networks eg meta-nodes representing protein complexes or gene families andcatalysis reactions represented by edges pointing to edgesDedicated computational approaches are therefore required toconvert pathway diagrams into uniform network representa-tions eg using biology-driven rules as in the RBioconductorsoftware package lsquographitersquo [62]

Apart from this conversion task building tissue-specificpathwaynetwork representations or assigning probabilities fortissue specificity to the biological interactions in an existingnetwork is a further challenging problem because interactionsare rarely confirmed experimentally in multiple specific tissuesMost bioinformatics approaches for assigning tissue specificityscores to protein interactions use publicly available gene or pro-tein expression data from large-scale tissue profiling studies[63 64] eg predicting the potential for a molecular interactionto occur in a certain tissue depending on whether both inter-action partners are expressed in this tissue [65] Moreover dedi-cated pathway- and network-analysis methods have beendeveloped for tissue-specific disease gene prioritization andgene set enrichment analysis [23 66] which may also provideuseful filters for feature selection in machine learning methodsfor sample classification

While changes in the mean or variance of the overall mo-lecular activity in pathways can provide robust biomarkers forsome diseases pathways may also contain forks and only one

Diagnostic specimen classification | 7

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

representations of biochemical proteinndashprotein interaction (PPI)networks Ma et al [40] developed a corresponding approach toobtain more reliable disease association scores for genes by ex-ploiting neighborhood information from a PPI network Theyused a modified Pearson correlation coefficient to assess the as-sociation between microarray gene expression and numeric val-ues encoding the disease status of the samples (taking intoconsideration that these phenotype values may not have a nor-mal distribution) and assigned the Fisher-transformed gene-phenotype association scores to the corresponding proteins in aPPI Next they recalibrate these association scores by modelingthe underlying true scores for each gene using Markov RandomField theory [52] reestimating their values from weighted con-tributions of their network neighborsrsquo original associationscores (the weights are determined according to different net-work neighborhood definitions using either direct neighborsshortest path or diffusion kernel neighborhoods see [40] fordetails) When evaluating the utility of the recalibrated scoresfor disease gene prioritizations on microarray data using knownGene Ontology functional annotations conventional

prioritization approaches using only gene expression or PPI datawere outperformed (although the scoring approach could alsobe used for predictive model building this particular applicationwas not considered)

While the approach by Ma et al focuses on improving thedisease-association scores for individual genes Chuang et al[41] presented a method identifying and scoring entire disease-related subnetworks similar to their pathway association scor-ing approach discussed above (see Lee et al [29]) After comput-ing the mutual information (MI) between sample phenotypevalues (encoding the presence or absence of a disease) and dis-cretized expression values for each gene from a microarray dataset assigned to the proteins in a PPI they applied a greedysearch to expand the seed nodes in the network with locallymaximal MI scores Specifically each seed node was expandedsuch that the sum of scores for the expanded network moduleis maximized (the search stops when no extension increasesthe total score above a predefined improvement rate) Whentraining logistic regression classifiers on the normalized andaveraged activities of the resulting subnetworks for breast

Table 2 Overview of network-based methods for machine learning analysis of omics data sequential network activity scoring and predictionmethods are shown on top whereas machine learning approaches using embedded network-based feature selection are listed below the boldblack line

Methodologypublication

Network activityalteration scoring method Prediction method

Tuck et al [39] Sample-specific gene regulatory networks are constructed andsubnetwork activity is scored by summing over activeinteractions

Nearest neighbors decision tree NaıveBayes among others

Ma et al [40] Disease association is scored for genes based on gene expres-sion data and their neighborsrsquo association scores in a PPI net-work using Markov Random Field theory

The approach is evaluated for disease geneprioritization but is applicable for predict-ive feature selection in combination withany prediction method

Chuang et al [41] Normalized gene expression data is mapped onto a proteininteraction network and discriminative subnetworks areidentified via a greedy search procedure

Logistic regression

Taylor et al [42] Hub nodes in protein interaction networks are determined andthe relative gene expression of hubs with each of their inter-acting partners is computed to identify hubs with diverserelative expression across sample groups

Affinity propagation clustering is used to as-sign a probability of poor prognosis tobreast cancer patients

Petrochilos et al [43] A random walk community detection algorithm is applied todiscover modules in a molecular interaction network andgene expression data is used to identify disease-associatedmodules

The approach is used to identify cancer-asso-ciated network modules and validated byscoring the enrichment of known cancer-related genes extracted from the OMIMdatabase

Rapaport et al [44] Spectral decomposition of gene expression profiles is appliedwith respect to the eigenfunctions of a network graphattenuating the high-frequency components of the expres-sion profiles with respect to the graph topology

SVM

Li et al [45] A network-constrained regularization procedure for linear re-gression analysis is used to identify disease-related discrim-inative subnetworks

Penalized linear regression

Yang et al [46] Three machine learning methods for graph-guided feature se-lection and grouping are proposed including a convex func-tion and two non-convex formulations designed to reducethe estimation bias

Penalized least squares-based approach(GOSCAR Graph octagonal shrinkage andclustering algorithm for regression)

Lorbert et al [47 48] A sparse regression approach is proposed using the PEN pen-alty to favor the grouping of strongly correlated featuresbased on pairwise similarities (eg derived from a molecularinteraction graph)

Penalized regression (PEN penalty)

Vlassis et al [49] Penalized logistic regression is applied using a convex PEN pen-alty function (see approach by Lorbert et al) with absolutefeature weights to better reflect the relevance of discrimina-tive genes in the feature selection

Penalized logistic regression (PEN penaltywith absolute feature weights)

Diagnostic specimen classification | 5

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer data the authors found that the subnetwork markerswere more reproducible than single-gene markers andprovided higher accuracy in distinguishing metastatic fromnonmetastatic tumors

As an intermediate solution between focusing on individualbiomolecules and entire network modules Taylor et al [42] pro-posed a method that investigates network nodes with outstand-ing topological properties and their direct neighbors Aftercomputationally mapping breast cancer gene expression dataonto the in silico representation of a PPI network they deter-mined proteins with large numbers of biochemical interactionpartners (so-called lsquohub nodesrsquo) and computed their relative ex-pression compared with each of these interacting partnersThey then determined for which hubs the relative expressiondiffered significantly between long-term survivors and patientswho died from the disease and applied a clustering approach toassign a probability of poor prognosis to new patient samples(the specific method used is known as lsquoaffinity propagationclusteringrsquo in the literature) The approach was evaluated using5-fold cross-validation providing accuracy sensitivity and spe-cificity estimates that compared favorably with the reportedresults for commercially available genomic breast cancerdiagnostics

Instead of considering topological properties of individualnodes in a molecular network information from a networkgraph can also be extracted via algorithms for finding sub-graphs which stand out in terms of their high density ofmolecular interactions (using approaches referred to as lsquocom-munity identificationrsquo or lsquograph clusteringrsquo methods in the lit-erature) Petrochilos et al [43] proposed a correspondingapproach which first applies a graph-based random walk algo-rithm on a genome-scale molecular network Information fromcancer-related gene expression data was then integrated intothe network by setting the weight of each network node as themaximum fold change of probes corresponding to its gene sym-bol (weights for biochemical interactions are determined by thesquare of the mean of the absolute fold changes of the relevantinteraction partners) Finally the score of a network module ofconnected nodes was obtained by comparing its cumulative ac-tivity (ie the square of the average weighted expression for allits nodes) against a bootstrap distribution of cumulative activ-ities obtained via random sampling of a matched number offold change values When testing the enrichment of known can-cer genes in the top-scored network modules identified withthis approach a similar or better performance was reached incomparison with other widely used module-finding algorithms(potential alternative applications of the identified modules forbiomarker modeling were not assessed in this publication)

Apart from averaging molecular activities over networkneighborhoods or using community identification methods sig-nal processing techniques may provide a further means to gleanuseful information from a network for predictive model build-ing as shown in an approach by Rapaport et al [44] They madeuse of the observation that genes in close proximity to eachother in a network tend to have similar expression and pro-posed to denoise microarray measurements by removing theirhigh-frequency component over the network For this purposespectral decomposition of gene expression profiles with respectto a molecular network graph was applied followed by attenu-ating high-frequency signal components expected to representmeasurement noise The method was evaluated for supervisedanalysis of irradiated and nonirradiated yeast strains using aSVM providing similar classification performance as a modelbuilt without the network-based filtering but facilitating

biological data interpretation by grouping the selected biomol-ecules according to their participation in the network modules

One-step machine learning approaches for networkanalysis

In contrast to the network analysis approaches considered sofar which apply network feature extraction and predictive ma-chine learning analysis in separate steps more recently one-step network-based feature selection approaches have beenproposed integrating the attribute selection directly into thepredictive model building Most of these approaches formulatethe model building task as an optimization problem formula-tion in which the objective function for classification or regres-sion is extended by a penalty term promoting the selection ofgrouped features in a molecular network (this strategy is alsoreferred to as network-constraint regularization)

Li et al [45] proposed one of the first correspondingapproaches by adding a penalty term to linear regression incor-porating network information into the analysis via theLaplacian matrix of the network graph The approach penalizesthe L1-norm of the feature weights and encourages a smoothprofile of weights over neighboring nodes in the networkHowever Binder and Schumacher later reported that themethod has lower performance than a null model ie a modelnot using any covariate information [53] As possible explan-ations they note that Li et al discarded censored observationsand about 20 000 variables that could not be assigned to corres-ponding nodes in the molecular interaction network (see sec-tion on lsquolimitationsrsquo below) Yang et al [46] suggested that thepreviously used network grouping penalties can introduce add-itional estimation bias into the model when the coefficientsigns for two features connected in the graph are differentThey presented alternative penalties to achieve network group-ing and sparse feature selection in particular two non-convexpenalties which shrink only small differences in absolute val-ues of feature weights to reduce the estimation bias [46] In ex-periments on synthetic data and two real data sets the newapproaches outperformed previous feature grouping methods

However with non-convex penalties finding global optimalsolutions is often not feasible and even identifying good localoptima may require high computational effort Lorbert et al [4748] proposed an alternative generic convex penalty thePairwise Elastic Net (PEN) that provides sparse feature selectionand promotes the grouping of attributes according to auser-defined feature similarity measure (eg obtained from bio-chemical interaction weights in a molecular network) PEN is ageneralization of the Elastic Net a method providing a trade-offbetween L1- and L2-penalized regressions by an adjustable par-ameter In PEN this parameter can be replaced to determine thetrade-off using additional information from an attributesimilarity matrix (different instances of PEN can be defined aslong as the similarity matrix is positive semidefinite and non-negative) Comparing PEN against other popular machine learn-ing approaches on simulated data with a grouping structureamong the features PEN achieved a competitive mean squarederror (MSE) and provided sparser solutions than approacheswith similar MSE

More recently Vlassis et al introduced a new instance ofPEN which penalizes differences between the lsquoabsolutersquo valuesof weights of interlinked features in a network graph The mo-tivation behind this approach termed GenePEN is that themagnitude of a weight in a linear model reflects the predictivevalue of the corresponding variable so that the weights for

6 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

irrelevant features are driven to zero by the penalty By ensuringthe convexity of the penalty function global optimal solutionscan be identified efficiently with existing optimization frame-works When evaluating GenePEN on simulated data andreal-word microarray data sets in comparison with other classi-fication methods using feature grouping the method providedsimilar predictive power and gene selections sharing signifi-cantly more connections within a molecular interaction net-work Visualization of the corresponding subnetworks enableda biological interpretation of disease-affected network regionswhich were enriched in known disease-related genes obtainedfrom literature mining

Overall network-based sample classification methods pro-vide a new means to analyze complex omics data sets allowingresearchers to identify coherent molecular network alterationsunder different biological conditions Identifying such network-level patterns in omics data for diseases with complexmolecular manifestations can shed new light on the molecularmechanisms of the disease and facilitate the development ofrobust multifactorial biomarker signatures

In contrast to single-molecule-based biomarker modeling anetwork-level approach has the potential to capture diverse fac-ets of a heterogenic disease reflected via alteration patterns indifferent network regions In comparison with pathway-basedmachine learning approaches methods using genome-scalednetworks as prior knowledge may produce models that are moredifficult to interpret biologically but which can identify a muchwider range of alterations in cellular processes (covering manygenes proteins or metabolites without any known pathway an-notations) Finally network- and pathway-based classificationapproaches share the main benefit of improving model robust-ness in cross-study analyses as compared with using individualbiomolecules as features Among these new higher-level bio-marker signatures network-based signatures accounting formolecular activities over larger and algorithm-derived networkregions may often provide more robust multifactorial markersthan signatures for smaller pathways which are typicallydefined subjectively possibly overlooking relevant functionallyrelated molecules in the surrounding network However themodel robustness will also depend on other factors eg the oc-currence of protein complexes in the studied pathwaynetwork(members of these complexes tend to have highly coordinatedactivity providing more robust averages) and the reliability andcompleteness of the specific network or pathway data sourceused (see limitations discussed in the following section)

Limitations and possible solution strategies

While pathway- and network-based analyses of omics data canenrich the biological interpretation of complex molecular alter-ation patterns and facilitate robust biomarker modeling usersshould also be aware of common limitations behind thesemethods and possibilities to address them

Although detailed functional annotations have been col-lected for a large portion of genes in humans and commonmodel organisms many identified genes still lack any func-tional assignment Consequently a significant portion of genesand proteins analyzed via current high-throughput measure-ment techniques can often not be assigned to any pathway ormolecular network for systems-level analyses Similarly sev-eral gene regulatory interactions and PPIs are still unknownand experimental techniques to identify molecular interactionsmay also generate false-positive results [54] Thus public mo-lecular interaction databases only cover a subset of biochemical

interactions occurring in living cells and the reliability of re-ported interactions varies depending on the amount and type ofassociated evidence

A further limitation of current pathway databases andassembled molecular networks is that tissue- and cell-type spe-cificity of molecular interactions and protein functions is typic-ally not considered in these data resources This can beparticularly problematic when studying diseases showing se-lective vulnerability for specific tissues or cell types eg asfound in many neurodegenerative disorders [55] Similarly inmany biomarker profiling studies and public omics data sour-ces concomitant data from different tissue and cell types for adisease of interest is lacking as well as longitudinal datarequired to identify relevant dynamic changes during diseasepathogenesis and progression Moreover a variety of limita-tions related to the specific origin processing and storage ofspecimens collected for biomarker studies may negatively im-pact subsequent analyses and the comparability of data fromdifferent studies For example studies might differ in terms ofthe sample storage durations in terms of whether the referencetissue for tumor tissue is derived from the same patient or froman unaffected control individual whether a preinfarction speci-men is accepted as a reference for the specimen obtained on in-farct suspicion or whether laser capture microdissection isapplied to isolate specific cells of interest rather than using het-erogeneous cell populations

To account for the varying reliability of public biochemicalinteraction data when assembling a molecular network variousapproaches to determine confidence scores have been proposed[56ndash59] One of the most comprehensive resources is theSTRING PPI database which allows users to filter and extractbiochemical interactions for a wide range of organisms usingdifferent sources of evidence as well as a combined confidencescore [60 61]

A frequently encountered difficulty in pathway analyses isalso the conversion of the information content of public path-way diagrams into adequate data structures for computationalinvestigations Pathway structures usually contain more com-plex layers of information than genendashgene networks eg meta-nodes representing protein complexes or gene families andcatalysis reactions represented by edges pointing to edgesDedicated computational approaches are therefore required toconvert pathway diagrams into uniform network representa-tions eg using biology-driven rules as in the RBioconductorsoftware package lsquographitersquo [62]

Apart from this conversion task building tissue-specificpathwaynetwork representations or assigning probabilities fortissue specificity to the biological interactions in an existingnetwork is a further challenging problem because interactionsare rarely confirmed experimentally in multiple specific tissuesMost bioinformatics approaches for assigning tissue specificityscores to protein interactions use publicly available gene or pro-tein expression data from large-scale tissue profiling studies[63 64] eg predicting the potential for a molecular interactionto occur in a certain tissue depending on whether both inter-action partners are expressed in this tissue [65] Moreover dedi-cated pathway- and network-analysis methods have beendeveloped for tissue-specific disease gene prioritization andgene set enrichment analysis [23 66] which may also provideuseful filters for feature selection in machine learning methodsfor sample classification

While changes in the mean or variance of the overall mo-lecular activity in pathways can provide robust biomarkers forsome diseases pathways may also contain forks and only one

Diagnostic specimen classification | 7

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

cancer data the authors found that the subnetwork markerswere more reproducible than single-gene markers andprovided higher accuracy in distinguishing metastatic fromnonmetastatic tumors

As an intermediate solution between focusing on individualbiomolecules and entire network modules Taylor et al [42] pro-posed a method that investigates network nodes with outstand-ing topological properties and their direct neighbors Aftercomputationally mapping breast cancer gene expression dataonto the in silico representation of a PPI network they deter-mined proteins with large numbers of biochemical interactionpartners (so-called lsquohub nodesrsquo) and computed their relative ex-pression compared with each of these interacting partnersThey then determined for which hubs the relative expressiondiffered significantly between long-term survivors and patientswho died from the disease and applied a clustering approach toassign a probability of poor prognosis to new patient samples(the specific method used is known as lsquoaffinity propagationclusteringrsquo in the literature) The approach was evaluated using5-fold cross-validation providing accuracy sensitivity and spe-cificity estimates that compared favorably with the reportedresults for commercially available genomic breast cancerdiagnostics

Instead of considering topological properties of individualnodes in a molecular network information from a networkgraph can also be extracted via algorithms for finding sub-graphs which stand out in terms of their high density ofmolecular interactions (using approaches referred to as lsquocom-munity identificationrsquo or lsquograph clusteringrsquo methods in the lit-erature) Petrochilos et al [43] proposed a correspondingapproach which first applies a graph-based random walk algo-rithm on a genome-scale molecular network Information fromcancer-related gene expression data was then integrated intothe network by setting the weight of each network node as themaximum fold change of probes corresponding to its gene sym-bol (weights for biochemical interactions are determined by thesquare of the mean of the absolute fold changes of the relevantinteraction partners) Finally the score of a network module ofconnected nodes was obtained by comparing its cumulative ac-tivity (ie the square of the average weighted expression for allits nodes) against a bootstrap distribution of cumulative activ-ities obtained via random sampling of a matched number offold change values When testing the enrichment of known can-cer genes in the top-scored network modules identified withthis approach a similar or better performance was reached incomparison with other widely used module-finding algorithms(potential alternative applications of the identified modules forbiomarker modeling were not assessed in this publication)

Apart from averaging molecular activities over networkneighborhoods or using community identification methods sig-nal processing techniques may provide a further means to gleanuseful information from a network for predictive model build-ing as shown in an approach by Rapaport et al [44] They madeuse of the observation that genes in close proximity to eachother in a network tend to have similar expression and pro-posed to denoise microarray measurements by removing theirhigh-frequency component over the network For this purposespectral decomposition of gene expression profiles with respectto a molecular network graph was applied followed by attenu-ating high-frequency signal components expected to representmeasurement noise The method was evaluated for supervisedanalysis of irradiated and nonirradiated yeast strains using aSVM providing similar classification performance as a modelbuilt without the network-based filtering but facilitating

biological data interpretation by grouping the selected biomol-ecules according to their participation in the network modules

One-step machine learning approaches for networkanalysis

In contrast to the network analysis approaches considered sofar which apply network feature extraction and predictive ma-chine learning analysis in separate steps more recently one-step network-based feature selection approaches have beenproposed integrating the attribute selection directly into thepredictive model building Most of these approaches formulatethe model building task as an optimization problem formula-tion in which the objective function for classification or regres-sion is extended by a penalty term promoting the selection ofgrouped features in a molecular network (this strategy is alsoreferred to as network-constraint regularization)

Li et al [45] proposed one of the first correspondingapproaches by adding a penalty term to linear regression incor-porating network information into the analysis via theLaplacian matrix of the network graph The approach penalizesthe L1-norm of the feature weights and encourages a smoothprofile of weights over neighboring nodes in the networkHowever Binder and Schumacher later reported that themethod has lower performance than a null model ie a modelnot using any covariate information [53] As possible explan-ations they note that Li et al discarded censored observationsand about 20 000 variables that could not be assigned to corres-ponding nodes in the molecular interaction network (see sec-tion on lsquolimitationsrsquo below) Yang et al [46] suggested that thepreviously used network grouping penalties can introduce add-itional estimation bias into the model when the coefficientsigns for two features connected in the graph are differentThey presented alternative penalties to achieve network group-ing and sparse feature selection in particular two non-convexpenalties which shrink only small differences in absolute val-ues of feature weights to reduce the estimation bias [46] In ex-periments on synthetic data and two real data sets the newapproaches outperformed previous feature grouping methods

However with non-convex penalties finding global optimalsolutions is often not feasible and even identifying good localoptima may require high computational effort Lorbert et al [4748] proposed an alternative generic convex penalty thePairwise Elastic Net (PEN) that provides sparse feature selectionand promotes the grouping of attributes according to auser-defined feature similarity measure (eg obtained from bio-chemical interaction weights in a molecular network) PEN is ageneralization of the Elastic Net a method providing a trade-offbetween L1- and L2-penalized regressions by an adjustable par-ameter In PEN this parameter can be replaced to determine thetrade-off using additional information from an attributesimilarity matrix (different instances of PEN can be defined aslong as the similarity matrix is positive semidefinite and non-negative) Comparing PEN against other popular machine learn-ing approaches on simulated data with a grouping structureamong the features PEN achieved a competitive mean squarederror (MSE) and provided sparser solutions than approacheswith similar MSE

More recently Vlassis et al introduced a new instance ofPEN which penalizes differences between the lsquoabsolutersquo valuesof weights of interlinked features in a network graph The mo-tivation behind this approach termed GenePEN is that themagnitude of a weight in a linear model reflects the predictivevalue of the corresponding variable so that the weights for

6 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

irrelevant features are driven to zero by the penalty By ensuringthe convexity of the penalty function global optimal solutionscan be identified efficiently with existing optimization frame-works When evaluating GenePEN on simulated data andreal-word microarray data sets in comparison with other classi-fication methods using feature grouping the method providedsimilar predictive power and gene selections sharing signifi-cantly more connections within a molecular interaction net-work Visualization of the corresponding subnetworks enableda biological interpretation of disease-affected network regionswhich were enriched in known disease-related genes obtainedfrom literature mining

Overall network-based sample classification methods pro-vide a new means to analyze complex omics data sets allowingresearchers to identify coherent molecular network alterationsunder different biological conditions Identifying such network-level patterns in omics data for diseases with complexmolecular manifestations can shed new light on the molecularmechanisms of the disease and facilitate the development ofrobust multifactorial biomarker signatures

In contrast to single-molecule-based biomarker modeling anetwork-level approach has the potential to capture diverse fac-ets of a heterogenic disease reflected via alteration patterns indifferent network regions In comparison with pathway-basedmachine learning approaches methods using genome-scalednetworks as prior knowledge may produce models that are moredifficult to interpret biologically but which can identify a muchwider range of alterations in cellular processes (covering manygenes proteins or metabolites without any known pathway an-notations) Finally network- and pathway-based classificationapproaches share the main benefit of improving model robust-ness in cross-study analyses as compared with using individualbiomolecules as features Among these new higher-level bio-marker signatures network-based signatures accounting formolecular activities over larger and algorithm-derived networkregions may often provide more robust multifactorial markersthan signatures for smaller pathways which are typicallydefined subjectively possibly overlooking relevant functionallyrelated molecules in the surrounding network However themodel robustness will also depend on other factors eg the oc-currence of protein complexes in the studied pathwaynetwork(members of these complexes tend to have highly coordinatedactivity providing more robust averages) and the reliability andcompleteness of the specific network or pathway data sourceused (see limitations discussed in the following section)

Limitations and possible solution strategies

While pathway- and network-based analyses of omics data canenrich the biological interpretation of complex molecular alter-ation patterns and facilitate robust biomarker modeling usersshould also be aware of common limitations behind thesemethods and possibilities to address them

Although detailed functional annotations have been col-lected for a large portion of genes in humans and commonmodel organisms many identified genes still lack any func-tional assignment Consequently a significant portion of genesand proteins analyzed via current high-throughput measure-ment techniques can often not be assigned to any pathway ormolecular network for systems-level analyses Similarly sev-eral gene regulatory interactions and PPIs are still unknownand experimental techniques to identify molecular interactionsmay also generate false-positive results [54] Thus public mo-lecular interaction databases only cover a subset of biochemical

interactions occurring in living cells and the reliability of re-ported interactions varies depending on the amount and type ofassociated evidence

A further limitation of current pathway databases andassembled molecular networks is that tissue- and cell-type spe-cificity of molecular interactions and protein functions is typic-ally not considered in these data resources This can beparticularly problematic when studying diseases showing se-lective vulnerability for specific tissues or cell types eg asfound in many neurodegenerative disorders [55] Similarly inmany biomarker profiling studies and public omics data sour-ces concomitant data from different tissue and cell types for adisease of interest is lacking as well as longitudinal datarequired to identify relevant dynamic changes during diseasepathogenesis and progression Moreover a variety of limita-tions related to the specific origin processing and storage ofspecimens collected for biomarker studies may negatively im-pact subsequent analyses and the comparability of data fromdifferent studies For example studies might differ in terms ofthe sample storage durations in terms of whether the referencetissue for tumor tissue is derived from the same patient or froman unaffected control individual whether a preinfarction speci-men is accepted as a reference for the specimen obtained on in-farct suspicion or whether laser capture microdissection isapplied to isolate specific cells of interest rather than using het-erogeneous cell populations

To account for the varying reliability of public biochemicalinteraction data when assembling a molecular network variousapproaches to determine confidence scores have been proposed[56ndash59] One of the most comprehensive resources is theSTRING PPI database which allows users to filter and extractbiochemical interactions for a wide range of organisms usingdifferent sources of evidence as well as a combined confidencescore [60 61]

A frequently encountered difficulty in pathway analyses isalso the conversion of the information content of public path-way diagrams into adequate data structures for computationalinvestigations Pathway structures usually contain more com-plex layers of information than genendashgene networks eg meta-nodes representing protein complexes or gene families andcatalysis reactions represented by edges pointing to edgesDedicated computational approaches are therefore required toconvert pathway diagrams into uniform network representa-tions eg using biology-driven rules as in the RBioconductorsoftware package lsquographitersquo [62]

Apart from this conversion task building tissue-specificpathwaynetwork representations or assigning probabilities fortissue specificity to the biological interactions in an existingnetwork is a further challenging problem because interactionsare rarely confirmed experimentally in multiple specific tissuesMost bioinformatics approaches for assigning tissue specificityscores to protein interactions use publicly available gene or pro-tein expression data from large-scale tissue profiling studies[63 64] eg predicting the potential for a molecular interactionto occur in a certain tissue depending on whether both inter-action partners are expressed in this tissue [65] Moreover dedi-cated pathway- and network-analysis methods have beendeveloped for tissue-specific disease gene prioritization andgene set enrichment analysis [23 66] which may also provideuseful filters for feature selection in machine learning methodsfor sample classification

While changes in the mean or variance of the overall mo-lecular activity in pathways can provide robust biomarkers forsome diseases pathways may also contain forks and only one

Diagnostic specimen classification | 7

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

irrelevant features are driven to zero by the penalty By ensuringthe convexity of the penalty function global optimal solutionscan be identified efficiently with existing optimization frame-works When evaluating GenePEN on simulated data andreal-word microarray data sets in comparison with other classi-fication methods using feature grouping the method providedsimilar predictive power and gene selections sharing signifi-cantly more connections within a molecular interaction net-work Visualization of the corresponding subnetworks enableda biological interpretation of disease-affected network regionswhich were enriched in known disease-related genes obtainedfrom literature mining

Overall network-based sample classification methods pro-vide a new means to analyze complex omics data sets allowingresearchers to identify coherent molecular network alterationsunder different biological conditions Identifying such network-level patterns in omics data for diseases with complexmolecular manifestations can shed new light on the molecularmechanisms of the disease and facilitate the development ofrobust multifactorial biomarker signatures

In contrast to single-molecule-based biomarker modeling anetwork-level approach has the potential to capture diverse fac-ets of a heterogenic disease reflected via alteration patterns indifferent network regions In comparison with pathway-basedmachine learning approaches methods using genome-scalednetworks as prior knowledge may produce models that are moredifficult to interpret biologically but which can identify a muchwider range of alterations in cellular processes (covering manygenes proteins or metabolites without any known pathway an-notations) Finally network- and pathway-based classificationapproaches share the main benefit of improving model robust-ness in cross-study analyses as compared with using individualbiomolecules as features Among these new higher-level bio-marker signatures network-based signatures accounting formolecular activities over larger and algorithm-derived networkregions may often provide more robust multifactorial markersthan signatures for smaller pathways which are typicallydefined subjectively possibly overlooking relevant functionallyrelated molecules in the surrounding network However themodel robustness will also depend on other factors eg the oc-currence of protein complexes in the studied pathwaynetwork(members of these complexes tend to have highly coordinatedactivity providing more robust averages) and the reliability andcompleteness of the specific network or pathway data sourceused (see limitations discussed in the following section)

Limitations and possible solution strategies

While pathway- and network-based analyses of omics data canenrich the biological interpretation of complex molecular alter-ation patterns and facilitate robust biomarker modeling usersshould also be aware of common limitations behind thesemethods and possibilities to address them

Although detailed functional annotations have been col-lected for a large portion of genes in humans and commonmodel organisms many identified genes still lack any func-tional assignment Consequently a significant portion of genesand proteins analyzed via current high-throughput measure-ment techniques can often not be assigned to any pathway ormolecular network for systems-level analyses Similarly sev-eral gene regulatory interactions and PPIs are still unknownand experimental techniques to identify molecular interactionsmay also generate false-positive results [54] Thus public mo-lecular interaction databases only cover a subset of biochemical

interactions occurring in living cells and the reliability of re-ported interactions varies depending on the amount and type ofassociated evidence

A further limitation of current pathway databases andassembled molecular networks is that tissue- and cell-type spe-cificity of molecular interactions and protein functions is typic-ally not considered in these data resources This can beparticularly problematic when studying diseases showing se-lective vulnerability for specific tissues or cell types eg asfound in many neurodegenerative disorders [55] Similarly inmany biomarker profiling studies and public omics data sour-ces concomitant data from different tissue and cell types for adisease of interest is lacking as well as longitudinal datarequired to identify relevant dynamic changes during diseasepathogenesis and progression Moreover a variety of limita-tions related to the specific origin processing and storage ofspecimens collected for biomarker studies may negatively im-pact subsequent analyses and the comparability of data fromdifferent studies For example studies might differ in terms ofthe sample storage durations in terms of whether the referencetissue for tumor tissue is derived from the same patient or froman unaffected control individual whether a preinfarction speci-men is accepted as a reference for the specimen obtained on in-farct suspicion or whether laser capture microdissection isapplied to isolate specific cells of interest rather than using het-erogeneous cell populations

To account for the varying reliability of public biochemicalinteraction data when assembling a molecular network variousapproaches to determine confidence scores have been proposed[56ndash59] One of the most comprehensive resources is theSTRING PPI database which allows users to filter and extractbiochemical interactions for a wide range of organisms usingdifferent sources of evidence as well as a combined confidencescore [60 61]

A frequently encountered difficulty in pathway analyses isalso the conversion of the information content of public path-way diagrams into adequate data structures for computationalinvestigations Pathway structures usually contain more com-plex layers of information than genendashgene networks eg meta-nodes representing protein complexes or gene families andcatalysis reactions represented by edges pointing to edgesDedicated computational approaches are therefore required toconvert pathway diagrams into uniform network representa-tions eg using biology-driven rules as in the RBioconductorsoftware package lsquographitersquo [62]

Apart from this conversion task building tissue-specificpathwaynetwork representations or assigning probabilities fortissue specificity to the biological interactions in an existingnetwork is a further challenging problem because interactionsare rarely confirmed experimentally in multiple specific tissuesMost bioinformatics approaches for assigning tissue specificityscores to protein interactions use publicly available gene or pro-tein expression data from large-scale tissue profiling studies[63 64] eg predicting the potential for a molecular interactionto occur in a certain tissue depending on whether both inter-action partners are expressed in this tissue [65] Moreover dedi-cated pathway- and network-analysis methods have beendeveloped for tissue-specific disease gene prioritization andgene set enrichment analysis [23 66] which may also provideuseful filters for feature selection in machine learning methodsfor sample classification

While changes in the mean or variance of the overall mo-lecular activity in pathways can provide robust biomarkers forsome diseases pathways may also contain forks and only one

Diagnostic specimen classification | 7

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

of their branches or stems may be affected in a pathologicalcondition Approaches focusing on changes in the overall path-way activity may therefore fail in detecting significant alter-ations affecting only one pathway branch Although networkanalysis methods identifying altered subnetworks may be help-ful in this situation (using a pathway graph as input instead of agenome-scale molecular network) ideally available knowledgeon distinct pathway branches or outstanding regions in a com-plex pathway architecture should be exploited directly eg bypartitioning a pathway into corresponding lsquosubpathwaysrsquo andseparately analyzing their molecular alterations using existingpathway-level bioinformatics tools

Another challenging and still largely unsolved problem inthe analysis of alterations for pathways networks and singlebiomolecules is the distinction between omics changes repre-senting causes or effects (or lsquodriversrsquo or lsquopassengersrsquo) of a dis-ease Apart from philosophical disagreements on howcausation should be defined [67] the discrimination betweencorrelated events and causally linked events on high-through-put omics data sets is typically severely hampered by largenumbers of potential confounding variables and complexnonlinear relationships between the attributes Thus conclu-sive evidence for causal links can usually only be obtained fromin vitro and in vivo perturbation experiments rather than fromhigh-throughput data analysis even when using high-qualitytime-series data with large sample sizes and dedicated statisticsfor identifying causendasheffect relationships (eg the Granger caus-ality test [68]) Prior biological knowledge eg on transcriptionfactorndashtarget relationships can however enable the construc-tion of causal graphs to check whether omics measurementsare consistent with certain causal molecular hypotheses or toprioritize corresponding hypotheses for experimental validation[69] Bill Shipley has written a dedicated book on methods forcausal analysis of observational data in biology [67] Theseapproaches are not only relevant for identifying drug targetsand investigating disease etiology and progression but mayalso provide useful information for the biological interpretationof diagnostic biomarker models

Improving incomplete pathways and networks by identify-ing missing molecular interactions is arguably one of the mostchallenging problems in this domain Pathway definitions aretypically created in a subjective manner because no universalpredefined criteria for determining pathway membership andboundaries are available and the curators of the correspondingdatabases are required to make judgments on the relevance ofpotential pathway members These subjective definitions maybe prejudiced and erroneous and even networks assembledautomatically from public databases may contain biases eg re-sulting from prior subjective decisions on which putative PPIsare of sufficient interest for experimental validation Moreoverin diseases the pathway topology may change and genericpathway databases have recently been complemented by dedi-cated pathway maps for specific human disorders egAlzPathway [70] for Alzheimerrsquos disease and PDMap [71] forParkinsonrsquos disease Because these disease-related pathwaymaps typically cover the biomolecules involved in a specific dis-order and the potentially altered network topology more com-prehensively and accurately than generic pathway mapsdisease-specific pathways should be considered as the pre-ferred resource for pathway-based diagnostic biomarker model-ing (in the specific case of a known disease-associated topologyalteration ideally a corresponding pathway map with un-altered topology should be used additionally as a reference inthe modeling process)

Moreover dedicated bioinformatics techniques have beendeveloped to identify lsquonetwork rewiringrsquo events eg brokenchains or new interactions changing the pathwaynetwork top-ology which remain undetected by conventional pathway ana-lyses Instead of assuming that molecules close to each other ina network undergo similar changes in a disease theseapproaches identify alterations in the correlations between theactivities of these molecules as signatures for potential patho-logical changes in network topology [72ndash74] However the top-scoring results require experimental verification eg using invitro disease models and the success may strongly depend onwhether reliable prior pathway knowledge is available

To support relevant manual pathway curation efforts com-putational approaches for ab initio discovery of pathways withingenetic or molecular networks have been proposed [75ndash78] aswell as bioinformatics methods to extend existing pathway def-initions according to objective criteria for pathway compactnessand connectedness [79] Moreover various machine learningapproaches for predicting unknown PPIs using public structuraland functional information have been developed [80ndash84] egthe authors of the PrePPI method report validation results sug-gesting their prediction algorithm is comparable in accuracy tohigh-throughput experiments [85] For proteins with unknownstructure and few functional annotations these in silico predic-tion approaches are however still bound to provide limited reli-ability and only high-confidence biochemical interactionsshould be included in networks for omics data analysis

Finally to reduce the influence of false-positive predictionson network analyses molecular interactions should not only beprefiltered by confidence when assembling the network butalso weighted by confidence to apply analysis techniquesexploiting these weights to score the reliability of identified net-work alterations In subsequent biological validation experi-ments investigators can then focus on verifying only the mostreliable network alteration patterns derived from the in silicoanalyses

Translating multifactorial marker models intodiagnostic testsPathways from omics research discoveries to clinicaldiagnostic assays

High-throughput omics measurement techniques are typicallynot designed for diagnostic applications but for broad systems-level analyses hypothesis generation and the construction offirst tentative machine learning models for sample classifica-tion Such tentative models require subsequent refinement andvalidation using more sensitive and reproducible measurementtechniques to evaluate their potential for diagnostic applica-tions For example a sample classification model built andcross-validated using microarray gene expression data with anembedded feature selection to choose only the most inform-ative genes as predictors can be validated using more accuratequantitative reverse transcription polymerase chain reaction(qRT-PCR) measurements for the subset of chosen genes

Importantly to avoid wrong conclusions in the evaluation ofdiagnostic classification models adequate statistical methodshave to be chosen to assess a modelrsquos overall predictive per-formance (quantifying how close predictions are to the actualoutcome) its calibrationreliability (measuring how close to x of100 individuals with a risk prediction of x have the outcome)and its discriminative ability (determining whether individualswith the outcome have higher risk predictions than those

8 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

without) [86] Owing to the inherent uncertainty associated withdiagnostic tasks predictions should be provided in a probabilis-tic rather than deterministic form [87] and the overall per-formance should be quantified using so-called lsquoproper scoringrulesrsquo for which the expected score is optimized when the pre-dictive distribution agrees with the true distribution of thequantity to be estimated (a corresponding example is the Brierscore for binary and survival outcomes [88]) Conversely the op-timization of models with respect to conventional discontinu-ous non-error rates like the percentage correct classificationcan provide misleading results eg when the predicted proba-bilities are close to the chosen decision threshold required forthese measures [89] To assess the calibration of a model theHosmerndashLemeshow lsquogoodness-of-fitrsquo test can be used [90] andthe concordance statistic to quantify the discrimination ability[91] If a reference prediction system is already available dedi-cated measures of the relative improvement achieved with anew prediction method should additionally be computed(referred to as the lsquoskillrsquo eg quantified via the Brier Skill Score[92]) Moreover decision-analytic approaches like decision curveanalysis [93] designed to assess the net benefit achieved makingdecisions according to model predictions should be applied if themodel is to be used to direct clinical patient management [86]

For the study design initial power calculations are requiredto ensure that sufficient sample sizes are available for all statis-tical assessments [94] This also involves choosing an adequatesplitting of the measured data into training test and validationsets and selecting suitable cross-validation or resampling tech-niques for model optimization and evaluation (eg using two-level external cross-validation [95]) [96]

Importantly clinical validation does not only necessitatesignificantly larger sample sizes than most research studiesbut also independent replication tests on data from other pa-tient cohorts the clear specification of the biological rationalebehind the method and a demonstration of its clinical utility Incontrast to the regulatory framework for drugs there are mul-tiple pathways for the translation of omics-based tests intovalidated in vitro diagnostic test devices These tests can bedeveloped and validated either through review by the Food andDrug Administration (FDA) or through validation and perform-ance by a specific laboratory certified according to ClinicalLaboratory Improvement Amendments (CLIA) [97]

Because using established medical product developmentpipelines as in pharmaceutical companies is not common prac-tice in academia for many biomedical research institutions anearly collaboration with an experienced industrial partner isoften advisable Although currently no unique and widely rec-ognized standard process for translating omics researchfindings into clinical diagnostics is available common recom-mendations by widely recognized health organizations can befollowed In particular a committee by the US Institute ofMedicine has conducted a study on omics-based clinical test de-velopment and proposed a generic process for the developmentand evaluation of these tests as a recommended guideline [97]A corresponding example process which is briefly outlined forillustration purposes in Figure 1 and not meant to cover all im-portant variations starts with the discovery phase in which acandidate biomarker model is built on a training set lockeddown and evaluated on a test data set (this set of samplesshould be completely independent from the training set) In thefollowing test validation phase after institutional review boardapproval and consultation with the FDA a CLIA-certified labora-tory defines and optimizes the diagnostic test method clinicallyand biologically validates the test on a blinded sample set and

implements the test according to current clinical laboratorystandards

Interestingly the authors of the guideline highlight that afrequent shortcoming of omics-based tests is the lack of a biolo-gical rationale behind the testmdashwhile single-molecule markersare often known to play a role in the disease multifactorialomics models obtained from machine learning are often moredifficult to interpret and involve a greater risk of overfittingNew pathway- and network-based modeling techniques as dis-cussed in this review could therefore help to address some ofthese shortcomings and provide more interpretable and robustmodels as opposed to classical lsquoblack boxrsquo machine learningmodels

In the following stage of the clinical development processthe locked-down test is evaluated for clinical utility via one ofthe following approaches (i) A prospectivendashretrospective studyusing archived specimens from previous clinical trials (ii) a pro-spective clinical trial in which the test (a) directs patient man-agement or (b) does not direct patient management [97] Thecomplexity and duration of a corresponding clinical study ortrial will largely depend on the specific type of biomarker de-veloped and the proposed clinical benefit For diagnostic bio-markers focused on in this review procedures may varysignificantly depending on whether the test is designed to de-tect the presence severity or the subtype of a diseasePrognostic biomarkers which indicate the future clinical courseof a patient with regard to a specific outcome and predictivebiomarkers which predict the responders and the extent of sus-ceptibility to a particular drug effect will also require differentdevelopment and evaluation procedures than diagnosticmarkers Finally for each type of biomarker different clinicalbenefits may be envisaged and significantly influence the de-sign of a study eg the goal to choose more appropriate treat-ment options or the objective to diagnose the disease earlier toenable more effective therapies to prevent halt or slow downits progression

Previous success stories in omics-based development ofdiagnostic assays

A variety of multifactorial omics-based biomarker models havebeen translated successfully into diagnostic tests in recentyears in particular in the field of cancer sub-type stratificationA prominent example is the Oncotype DX test to assess a wom-anrsquos risk of recurrence of early-stage estrogen-receptor-positivebreast cancer and the likelihood of benefitting from chemother-apy after surgery This test measures the activity of 21 genes intumor samples and then determines a recurrence score numberbetween 0 and 100 (higher scores reflect greater risk of recur-rence within 10 years) As opposed to other diagnostic testsusing frozen samples the Oncotype DX assay uses tumor tissuesamples that are chemically preserved and sealed in paraffinwax (see [98 99] for details on the sample collection andanalytics)

The development of Oncotype DX involved typical steps ofan omics biomarker profiling and top-down filtering approachfirst by analyzing the entire transcriptome on high-throughputmicroarray data and using knowledge from the literature andgenomic databases 250 candidate marker genes were selected[98] The relation between the expression of these candidatesand the recurrence of breast cancer was then assessed in datafrom three independent clinical studies on 447 patients The re-sults were used for a final filtering providing a panel of 16 can-cer-related genes and 5 reference genes whose expression

Diagnostic specimen classification | 9

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

levels enabled the calculation of recurrence scores for tumorsamples via a machine learning model This diagnostic ap-proach was successfully validated in multiple clinical studiesand has been included in the treatment guidelines for breastcancer by the National Comprehensive Cancer Center Networkand the American Society of Clinical Oncology

While in the case of the Oncotype DX test the set of requiredmarkers could be narrowed down to a small number of geneswith prior knowledge on their relationship to the disease forother complex and more heterogeneous diseases significantlylarger numbers of molecular predictors may be needed for ac-curate diagnosis In such cases pathway- and network-basedmodeling approaches may facilitate the generation of robustand biologically interpretable models which could afterwardundergo similar diagnostic test development and validationprocedures as the initial model behind the Oncotype DX assayImportantly the success of the Oncotype DX approach is not anisolated case but other commercial diagnostic assays havebeen developed and validated using similar strategies includingMammaPrint [100] Prosigna (PAM50) [101] Mammostrat [102]Tissue of Origin [103] AlloMap [104] Corus CAD [105] and OVA1[106] among others

In summary successful translation of omics-based bio-marker models into clinically accepted commercial diagnostictests has been achieved in multiple cases in the past Given alarge number of complex diseases for which more reliable ear-lier and cheaper diagnostic tests are still needed there is a

significant potential to develop improved approaches usingomics-based biomarker modeling and exploiting prior biologicalknowledge from pathways and molecular networks

Conclusions

For diseases with complex molecular manifestations omics-based biomarker models have the potential to better reflectclinically relevant aspects of their heterogeneity than classicalsingle-molecule diagnostic markers However various statis-tical limitations in high-dimensional data analysis often result-ing in a high risk of overfitting and restricted biologicalinterpretability of standard lsquoblack boxrsquo machine learning mod-els can hamper the progress in developing new clinically usefulbiomarker signatures

Among possible strategies to reduce model complexity andintegrate prior biological knowledge into the model buildingprocedure cellular pathway- and molecular network-based pre-diction methods provide promising new routes toward more re-liable omics-based biomarker signatures A wide variety ofbioinformatics approaches have already been developed in re-cent years to identify disease-associated pathway and networkalterations in omics data and use them to classify new samplesWhile public molecular networks and pathway definitions alsohave specific limitations that need to be addressed by the usereg by assigning confidence weights to molecular interactionspathway- and network-level predictive analyses often lead to

Figure 1 Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the

US Institute of Medicine [97] focusing on the major steps in the pipeline) After the transition from the second to the third phase (highlighted by the lock symbol) the

diagnostic test must be fully defined validated and locked down Many important variants and alternatives to the outlined example process exist as well as different

realizations of generic steps in the process (eg cases in which a test directs patient management may cover different situations depending on whether clinicians are

free to use the test result as they see fit or whether predefined procedures have to be followed subject to contraindications andor subject to the test results) The setup

may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm depending on whether

the test entails a treatment delay and whether the adequate cutoff threshold for the test is uncertain

10 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

classification models with improved interpretability and robust-ness in cross-study prediction tasks These new systems-levelapproaches will therefore provide a valuable addition to theexisting toolset for omics-based biomarker modeling

Key Points

bull Omics-based machine learning models for diagnosticspecimen classification often lack biological interpret-ability and have limited cross-study robustness andaccuracy Integrating prior knowledge from cellularpathway and molecular network into these modelscan help to address these shortcomings

bull Public pathway and molecular interaction databasescover only a subset of the biochemical interactionsoccurring in living cells typically lack annotations ontissue-specific reactions and may also contain errorsWhen applying pathway or network analysis methodsthe user should be aware of these limitations and thedifferent options to address or alleviate them eg byusing confidence weighted interactions and estimatingtheir tissue specificity using public gene or protein ex-pression data

bull Pathway- and network-based machine learning meth-ods for diagnostic specimen classification are charac-terized by different strengths and weaknesses Whilepathway-based approaches have the benefit of provid-ing models with high biological interpretability thattake advantage of human expert knowledge on dis-ease-relevant cellular processes their coverage is lim-ited to biomolecules with known annotations On thecontrary network-based models are typically more dif-ficult to interpret but can exploit information on amuch wider range of biomolecules

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Dubois B Feldman HH Jacova C et al Advancing research

diagnostic criteria for Alzheimerrsquos disease the IWG-2 crite-ria Lancet Neurol 201413614ndash29

2 Chahine LM Stern MB Diagnostic markers for Parkinsonrsquosdisease Curr Opin Neurol 201124309ndash17

3 Tanaka T Tanaka M Tanaka T et al Biomarkers for colorec-tal cancer Int J Mol Sci 2010113209ndash25

4 Collins LG Haines C Perkel R et al Lung cancer diagnosisand management Am Fam Physician 20077556ndash63

5 Martin P Laupland KB Frost EH et al Laboratory diagnosisof Ebola virus disease Intensive Care Med 201541895ndash8

6 Mistry K Cable G Meta-analysis of prostate-specific antigenand digital rectal examination as screening tests for pros-tate carcinoma J Am Board Fam Pract 20031695ndash101

7 Bender R Lange S Adjusting for multiple testing - when andhow J Clin Epidemiol 200154343ndash9

8 Yu L Liu H Redundancy based feature selection for micro-array data In Proceedings of the 10th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining New York NY ACM 2004 737ndash42

9 Ding C Peng H Minimum redundancy feature selectionfrom microarray gene expression data J Bioinform ComputBiol 20053185ndash205

10 Bellman R Bellman RE Bellman RE et al Adaptive control proc-esses a guided tour Princeton Princeton University Press 1961

11 Verleysen M Francois D The curse of dimensionality indata mining Analysis 20053512758ndash70

12 Kim S Kon M DeLisi C Pathway-based classification of can-cer subtypes Biol Direct 2012721

13 Ideker T Sharan R Protein networks in disease Genome Res200818644ndash52

14 Ogata H Goto S Sato K et al KEGG Kyoto encyclopedia ofgenes and genomes Nucleic Acids Res 19992729ndash34

15 Nishimura D BioCarta Biotech Softw Internet Rep 20012117ndash20

16 Pico AR Kelder T Van Iersel MP et al WikiPathways path-way editing for the people PLoS Biol 200861403ndash7

17 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533D428ndash32

18 Schaefer CF Anthony K Krupa S et al PID thepathway interaction database Nucleic Acids Res 200937D674ndash9

19 Subramanian A Tamayo P Mootha VK et al Gene setenrichment analysis a knowledge-based approach for inter-preting genome-wide expression profiles Proc Natl Acad SciUSA 200510215545ndash50

20 Kim S-Y Volsky DJ PAGE parametric analysis of gene setenrichment BMC Bioinformatics 20056144

21 Luo W Friedman MS Shedden K et al GAGE generally ap-plicable gene set enrichment for pathway analysis BMCBioinformatics 200910161

22 Oron AP Jiang Z Gentleman R Gene set enrichment ana-lysis using linear models and diagnostics Bioinformatics2008242586ndash91

23 Glaab E Baudot A Krasnogor N et al EnrichNet network-based gene set enrichment analysis Bioinformatics 201228i451ndash7

24 Shannon P Markiel A Ozier O et al Cytoscape a softwareenvironment for integrated models of biomolecular inter-action networks Genome Res 2003132498ndash504

25 Van Iersel MP Kelder T Pico AR et al Presenting and explor-ing biological pathways with PathVisio BMC Bioinformatics20089399

26 Ilyin SE Belkowski SM Plata-Salaman CR Biomarker dis-covery and validation Technologies and integrativeapproaches Trends Biotechnol 200422411ndash16

27 Guo Z Zhang T Li X et al Towards precise classification ofcancers based on robust gene functional expression profilesBMC Bioinformatics 2005658

28 Tomfohr J Lu J Kepler TB Pathway level analysis of geneexpression using singular value decomposition BMCBioinformatics 20056225

29 Lee E Chuang H-Y Kim J-W et al Inferring pathway activitytoward precise disease classification PLoS Comput Biol 20084e1000217

30 Su J Yoon BJ Dougherty ER Accurate and reliable cancerclassification based on probabilistic inference of pathwayactivity PLoS One 20094e8161

31 Glaab E Schneider R PathVar analysis of gene and proteinexpression variance in cellular pathways using microarraydata Bioinformatics 201228446ndash7

32 Svensson JP Stalpers LJA Esveldt-van Lange REE et alAnalysis of gene expression using gene sets discriminates

Diagnostic specimen classification | 11

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

cancer patients with and without late radiation toxicityPLoS Med 200631904ndash14

33 Efroni S Schaefer CF Buetow KH Identification of key proc-esses underlying cancer phenotypes using biologic pathwayanalysis PLoS One 20072e425

34 Vaske CJ Benz SC Sanborn JZ et al Inference of patient-specific pathway activities from multi-dimensional cancer gen-omics data using PARADIGM Bioinformatics 201026i237ndash45

35 Breslin T Krogh M Peterson C et al Signal transductionpathway profiling of individual tumor samples BMCBioinformatics 20056163

36 Ashburner M Ball CA Blake JA et al Gene ontology tool forthe unification of biology The Gene Ontology ConsortiumNat Genet 20002525ndash9

37 Glaab E Garibaldi JM Krasnogor N Learning pathway-baseddecision rules to classify microarray cancer samplesGerman Conference on Bioinformatics 2010173123ndash34

38 Glaab E Bacardit J Garibaldi JM et al Using rule-based ma-chine learning for candidate disease gene prioritization andsample classification of cancer gene expression data PLoSOne 20127e39932

39 Tuck DP Kluger HM Kluger Y Characterizing disease statesfrom topological properties of transcriptional regulatorynetworks BMC Bioinformatics 20067236

40 Ma X Lee H Wang L et al CGI a new approach for prioritiz-ing genes by combining gene expression and protein-protein interaction data Bioinformatics 200723215ndash21

41 Chuang H-Y Lee E Liu Y-T et al Network-based classifica-tion of breast cancer metastasis Mol Syst Biol 20073140

42 Taylor IW Linding R Warde-Farley D et al Dynamic modu-larity in protein interaction networks predicts breast canceroutcome Nat Biotechnol 200927199ndash204

43 Petrochilos D Shojaie A Gennari J et al Using random walksto identify cancer-associated modules in expression dataBioData Min 2013617

44 Rapaport F Zinovyev A Dutreix M et al Classification ofmicroarray data using gene networks BMC Bioinformatics2007835

45 Li C Li H Network-constrained regularization and variableselection for analysis of genomic data Bioinformatics 2008241175ndash82

46 Yang S Yuan L Lai Y-C et al Feature grouping and selectionover an undirected graph KDD 2012922ndash30

47 Lorbert A Eis D Kostina V et al Exploiting covariate similar-ity in sparse regression via the pairwise elastic net Int ConfArtif Intell Stat 20109477ndash84

48 Lorbert A Ramadge PJ The pairwise elastic net support vec-tor machine for automatic fMRI feature selection In ProcIEEE Int Conf Acoust Speech Signal Process Vancouver BC USA20131036ndash40 IEEE Press Washington DC

49 Vlassis N Glaab E GenePEN analysis of network activity al-terations in complex diseases via the pairwise elastic netStat Appl Genet Mol Biol 201514221ndash4

50 Wingender E Chen X Hehl R et al TRANSFAC an integratedsystem for gene expression regulation Nucleic Acids Res200028316ndash19

51 Newman MEJ The structure and function of complex net-works SIAM Rev 200345167ndash56

52 Geman S Geman D Stochastic relaxation Gibbs distribu-tions and the Bayesian restoration of images IEEE TransPattern Anal Mach Intell 19846721ndash41

53 Binder H Schumacher M Comment on ldquoNetwork-constrained regularization and variable selection for ana-lysis of genomic datardquo Bioinformatics 2008242566ndash8

54 Sprinzak E Sattath S Margalit H How reliable are experi-mental protein-protein interaction data J Mol Biol 2003327919ndash23

55 Saxena S Caroni P Selective neuronal vulnerability inneurodegenerative diseases from stressor thresholds todegeneration Neuron 20117135ndash48

56 Suthram S Shlomi T Ruppin E et al A direct comparison ofprotein interaction confidence assignment schemes BMCBioinformatics 20067360

57 Kamburov A Grossmann A Herwig R et al Cluster-basedassessment of protein-protein interaction confidence BMCBioinformatics 201213262

58 Li D Liu W Liu Z et al PRINCESS a protein interaction confi-dence evaluation system with multiple data sources MolCell Proteomics 200871043ndash52

59 Bader JS Chaudhuri A Rothberg JM et al Gaining confidencein high-throughput protein interaction networks NatBiotechnol 20042278ndash85

60 Von Mering C Huynen M Jaeggi D et al STRING a databaseof predicted functional associations between proteinsNucleic Acids Res 200331258ndash61

61 Franceschini A Szklarczyk D Frankild S et al STRING v91protein-protein interaction networks with increased cover-age and integration Nucleic Acids Res 201341D808ndash15

62 Sales G Calura E Cavalieri D et al Graphite - a Bioconductorpackage to convert pathway topology to gene network BMCBioinformatics 20121320

63 Su AI Wiltshire T Batalov S et al A gene atlas of the mouseand human protein-encoding transcriptomes Proc Natl AcadSci USA 20041016062ndash7

64 Ponten FK Schwenk JM Asplund A et al The human proteinatlas as a proteomic resource for biomarker discovery JIntern Med 2011270428ndash46

65 Bossi A Lehner B Tissue specificity and the human proteininteraction network Mol Syst Biol 20095260

66 Magger O Waldman YY Ruppin E et al Enhancing the pri-oritization of disease-causing genes through tissue specificprotein interaction networks PLoS Comput Biol 20128e1002690

67 Shipley B Cause and Correlation in Biology A Userrsquos Guide toPath Analysis Structural Equations and Causal Inference Vol 2Cambridge Cambridge University Press 2002 p 332

68 Society TE Granger CWJ Investigating causal relations byeconometric models and cross-spectral methodsEconometrica 196937424ndash38

69 Chindelevitch L Ziemek D Enayetallah A et al Causal rea-soning on biological networks interpreting transcriptionalchanges Bioinformatics 2012281114ndash21

70 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 2012652

71 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 20144988

72 Hudson NJ Reverter A Dalrymple BP A differential wiringanalysis of expression data correctly identifies the gene con-taining the causal mutation PLoS Comput Biol 20095e1000382

73 Reverter A Hudson NJ Nagaraj SH et al Regulatory impactfactors unraveling the transcriptional regulation ofcomplex traits from expression data Bioinformatics 201026896ndash904

74 Watson M CoXpress differential co-expression in gene ex-pression data BMC Bioinformatics 20067509

12 | Glaab

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1

75 Kelley R Ideker T Systematic interpretation of genetic inter-actions using protein networks Nat Biotechnol 200523561ndash6

76 Ulitsky I Shamir R Pathway redundancy and protein essen-tiality revealed in the Saccharomyces cerevisiae interactionnetworks Mol Syst Biol 20073104

77 Ma X Tarone AM Li W Mapping genetically compensatorypathways from synthetic lethal interactions in yeast PLoSOne 20083e1922

78 Brady A Maxwell K Daniels N et al Fault tolerance inprotein interaction networks Stable bipartite subgraphsand redundant pathways PLoS One 20094e5364

79 Glaab E Baudot A Krasnogor N et al Extending pathwaysand processes using molecular interaction networks to ana-lyse cancer genome data BMC Bioinformatics 201011597

80 Martin S Roe D Faulon JL Predicting protein-proteininteractions using signature products Bioinformatics 200521218ndash26

81 Singh R Xu J Berger B Struct2net integrating structure intoprotein-protein interaction prediction Pac Symp Biocomput200611403ndash14

82 Skrabanek L Saini HK Bader GD et al Computational pre-diction of protein-protein interactions Mol Biotechnol 2008381ndash17

83 Zhang QC Petrey D Garzon JI et al PrePPI a structure-in-formed database of protein-protein interactions NucleicAcids Res 201341D828ndash33

84 Zahiri J Mohammad-Noori M Ebrahimpour R et al LocFusehuman proteinndashprotein interaction prediction via classifierfusion using protein localization information Genomics2014104496ndash503

85 Zhang QC Petrey D Deng L et al Structure-based predictionof proteinndashprotein interactions on a genome-wide scaleNature 2012490556ndash60

86 Steyerberg EW Vickers AJ Cook NR et al Assessing the per-formance of prediction models a framework for traditionaland novel measures Epidemiology 201021128ndash38

87 Habbema JD Hilden J Bjerregaard B et al The measurementof performance in probabilistic diagnosis I the problem de-scriptive tools and measures based on classification matri-ces Methods Inf Med 197817217ndash26

88 Gerds TA Cai T Schumacher M The performance of riskprediction models Biom J 200850457ndash79

89 Hilden J Habbema JD Bjerregaard B The measurement ofperformance in probabilistic diagnosis III Methods basedon continuous functions of the diagnostic probabilitiesMethods Inf Med 197817238

90 Hosmer DW Hosmer T Le Cessie S et al A comparison ofgoodness-of-fit tests for the logistic regression model StatMed 199716965ndash80

91 Obuchowski NA Receiver operating characteristic curvesand their use in radiology Radiology 20032293ndash8

92 Brier GW Verification of forecasts expressed in terms ofprobability Mon Weather Rev 1950781ndash3

93 Vickers AJ Elkin EB Decision curve analysis a novel methodfor evaluating prediction models Med Decis Mak 200626565ndash74

94 Pintilie M Iakovlev V Fyles A et al Heterogeneity andpower in clinical biomarker studies J Clin Oncol 2009271517ndash21

95 Wood IA Visscher PM Mengersen KL Classification basedupon gene expression data bias and precision of error ratesBioinformatics 2007231363ndash70

96 Molinaro AM Simon R Pfeiffer RM Prediction error estima-tion a comparison of resampling methods Bioinformatics2005213301ndash7

97 Micheel CM Nass SJ Omenn GS et al Evolution ofTranslational Omics Lessons Learned and the Path ForwardWashington DC The National Academic Press 2012

98 Paik S Shak S Tang G et al A multigene assay to predict re-currence of tamoxifen-treated node-negative breast cancerN Engl J Med 20043512817ndash26

99 Dowsett M Cuzick J Wale C et al Prediction of risk of distantrecurrence using the 21-gene recurrence score in node-negative and node-positive postmenopausal patients withbreast cancer treated with anastrozole or tamoxifen aTransATAC study J Clin Oncol 2010281829ndash34

100 Bedard PL Mook S Piccart-Gebhart MJ et al MammaPrint70-gene profile quantifies the likelihood of recurrence forearly breast cancer Expert Opin Med Diagn 20093193ndash205

101 Nielsen T Wallden B Schaper C et al Analytical validationof the PAM50-based Prosigna Breast Cancer Prognostic GeneSignature Assay and nCounter Analysis System usingFormalin-fixed paraffin-embedded breast tumor specimensBMC Cancer 201414177

102 Bartlett JMS Bloom KJ Piper T et al Mammostrat as animmunohistochemical multigene assay for prediction ofearly relapse risk in the tamoxifen versus exemestane adju-vant multicenter trial pathology study J Clin Oncol 2012304477ndash84

103 Pillai R Deeter R Rigl CT et al Validation and reproducibilityof a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensJ Mol Diagnostics 20111348ndash56

104 Deng MC Eisen HJ Mehra MR et al Noninvasive discrimin-ation of rejection in cardiac allograft recipients using geneexpression profiling Am J Transplant 20066150ndash60

105 Elashoff MR Wingrove JA Beineke P et al Development of ablood-based gene expression algorithm for assessment ofobstructive coronary artery disease in non-diabetic patientsBMC Med Genomics 2011426

106 Abraham J OVA1 test for preoperative assessment of ovar-ian cancer Community Oncol 20107249ndash50

Diagnostic specimen classification | 13

at University of L

uxembourg on A

ugust 28 2015httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv044-TF1