Conference Review Data integration and analysis for medical...

5
Comparative and Functional Genomics Comp Funct Genom 2004; 5: 201–204. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.385 Conference Review Data integration and analysis for medical systems biology Johannes H. G. M. van Beek* Data Integration, Analysis and Logistics (DIAL), Centre for Medical Systems Biology, Leiden, Rotterdam and Amsterdam, The Netherlands *Correspondence to: Johannes H. G. M. van Beek, Vrije Universiteit, Faculty of Earth and Life Sciences, Department of Molecular Cell Physiology, De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands. E-mail: [email protected] Received: 7 November 2003 Revised: 21 December 2003 Accepted: 22 December 2003 Keywords: data integration; genomics; systems biology; databases; data mining Introduction It is like listening to a stewardess in a jet airliner who is explaining the safety measures: you have heard 1000 times before that the human genome has been sequenced and that a flood of data is coming over us. The question is how the mas- sively parallel measurements of large numbers of genes, messenger RNAs, proteins and metabolites are going to help us in prognosis and diagnosis of common human diseases. Is it a manageable problem to explain the behaviour of thousands of biomolecules from our knowledge of the molecular interactions in the cells of the human body? Can we infer from the large molecular datasets how the molecular pathways are organized and interact? It has been argued that the life sciences are devel- oping into a discovery- and data-driven science, with less emphasis on the hypothesis-driven experi- mental cycle. However, reasoning from experimen- tally determined facts to a well-founded theory of the underlying system is problematic. In his book on the structure of scientific revolutions, Kuhn [2] wrote, ‘But though this sort of fact-collecting has been essential to the origin of many significant sci- ences, anyone ... will discover that it produces a morass’. Is data mining in integrated exper- imental databases containing large quantities of genomic and systems biology data going to produce a morass, or is this approach useful for generating hypotheses and theories which, after corroboration, lead to valid knowledge? Medical systems biology Such questions are particularly important for med- ical systems biology. Systems biology may be defined as the study of the interactions of the large numbers of molecules (DNA, mRNAs, proteins, metabolites) that form the biological system. Sys- tems biology combines high-density measurement methods, such as DNA chips and proteomics, with computational analysis. The Centre for Medical Systems Biology (CMSB) in The Netherlands was opened on 1 July 2003. The CMSB is funded in the framework of a 5 year stimulation programme for genomics by the government and implemented by the Nether- lands Genomics Initiative [4]. In the CMSB several medical centres (Leiden University Medical Cen- tre, Vrije Universiteit Medical Centre, and Erasmus Copyright 2004 John Wiley & Sons, Ltd.

Transcript of Conference Review Data integration and analysis for medical...

Page 1: Conference Review Data integration and analysis for medical …downloads.hindawi.com/journals/ijg/2004/467876.pdf · 2019. 8. 1. · Comparative and Functional Genomics Comp Funct

Comparative and Functional GenomicsComp Funct Genom 2004; 5: 201–204.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.385

Conference Review

Data integration and analysis for medicalsystems biology

Johannes H. G. M. van Beek*Data Integration, Analysis and Logistics (DIAL), Centre for Medical Systems Biology, Leiden, Rotterdam and Amsterdam, The Netherlands

*Correspondence to:Johannes H. G. M. van Beek,Vrije Universiteit, Faculty of Earthand Life Sciences, Department ofMolecular Cell Physiology, DeBoelelaan 1085, 1081 HVAmsterdam, The Netherlands.E-mail: [email protected]

Received: 7 November 2003Revised: 21 December 2003Accepted: 22 December 2003

Keywords: data integration; genomics; systems biology; databases; data mining

Introduction

It is like listening to a stewardess in a jet airlinerwho is explaining the safety measures: you haveheard 1000 times before that the human genomehas been sequenced and that a flood of data iscoming over us. The question is how the mas-sively parallel measurements of large numbers ofgenes, messenger RNAs, proteins and metabolitesare going to help us in prognosis and diagnosisof common human diseases. Is it a manageableproblem to explain the behaviour of thousands ofbiomolecules from our knowledge of the molecularinteractions in the cells of the human body? Canwe infer from the large molecular datasets how themolecular pathways are organized and interact?

It has been argued that the life sciences are devel-oping into a discovery- and data-driven science,with less emphasis on the hypothesis-driven experi-mental cycle. However, reasoning from experimen-tally determined facts to a well-founded theory ofthe underlying system is problematic. In his bookon the structure of scientific revolutions, Kuhn [2]wrote, ‘But though this sort of fact-collecting hasbeen essential to the origin of many significant sci-ences, anyone . . . will discover that it produces

a morass’. Is data mining in integrated exper-imental databases containing large quantities ofgenomic and systems biology data going to producea morass, or is this approach useful for generatinghypotheses and theories which, after corroboration,lead to valid knowledge?

Medical systems biology

Such questions are particularly important for med-ical systems biology. Systems biology may bedefined as the study of the interactions of the largenumbers of molecules (DNA, mRNAs, proteins,metabolites) that form the biological system. Sys-tems biology combines high-density measurementmethods, such as DNA chips and proteomics, withcomputational analysis.

The Centre for Medical Systems Biology(CMSB) in The Netherlands was opened on 1 July2003. The CMSB is funded in the framework ofa 5 year stimulation programme for genomics bythe government and implemented by the Nether-lands Genomics Initiative [4]. In the CMSB severalmedical centres (Leiden University Medical Cen-tre, Vrije Universiteit Medical Centre, and Erasmus

Copyright 2004 John Wiley & Sons, Ltd.

Page 2: Conference Review Data integration and analysis for medical …downloads.hindawi.com/journals/ijg/2004/467876.pdf · 2019. 8. 1. · Comparative and Functional Genomics Comp Funct

202 J. H. G. M. van Beek

Medical Centre) collaborate with the Vrije Univer-siteit Amsterdam, Leiden University and the TNOPrevention and Health Research Institute, under thedirector Gertjan van Ommen [1].

At the CMSB, genomics and systems biology areused for identifying hidden connections betweencommon diseases, such as Alzheimer’s, depres-sion, migraine, metabolic syndrome, vascular dis-ease, thrombosis, arthritis, cancer and infectiousdiseases. Such connections between common dis-eases reflect underlying common biological path-ways and may become manifest in the form ofco-morbidity. Besides systems biology, another keyapproach in the CMSB is epidemiology, for whichlarge population and patient groups, tissue sampleand data collections are available.

The CMSB’s systems biology research strategyis to combine measurements at several biomolec-ular levels (genes, gene expression, proteins andmetabolites). The CMSB’s working hypothesis isthat interconnected changes at these vertical levelsprovide sensitive signatures of pathology that canbe of early prognostic and diagnostic value. Aneven bigger challenge is to understand the mea-sured changes in thousands of molecules simulta-neously in terms of the processes inside the cell.Understanding and controlling the causal relationsin the networks of intracellular signalling, tran-scriptional regulation and metabolism, among oth-ers, is important for understanding and influencingthe progress of disease. Therapeutic interventionscan then be aimed at strategically important pointsin the system. This goes beyond a single molecu-lar target approach and increases the efficiency ofintervention.

DIAL (Data Integration, Analysis andLogistics)

The integration of high-density data in such a med-ical genomics and systems biology centre requiresextensive use of computer-based approaches: inte-gration of databases; statistical analysis of correla-tions amongst molecular signatures and pathology;data mining to generate hypotheses by induction;and computational analysis of pathways by relat-ing newly measured data to external molecular andpathway databases. Therefore, the CMSB estab-lished a central project for data integration, analysisand logistics, termed DIAL.

To define interrelationships between phenotype,genotype and the intermediate biomolecular lev-els, linking population-based and patient-basedcohort databases containing data on pathology withdatabases of molecular laboratory measurements(e.g. SNPs, microarrays) is a first requirementthat is addressed. Further, there is a need to linkthe CMSB’s new experimental data to externaldatabases containing prior biological knowledge(gene annotations, pathways, etc.) to help in theinterpretation of the data. Given the high data vol-ume, the CMSB’s scientists should be supportedby artificial intelligence, text mining and efficientlinks between databases.

A fundamental question in the background is:how can valuable biological hypotheses be derivedby induction from such large amounts of experi-mental data, avoiding Kuhn’s morass? The induc-tive process during data mining should help toconstruct valuable hypotheses without creating aswamp of distracting findings reflecting noise inthe data or artefacts of the data mining method.

Knowledge by induction and data mining

At present, some life scientists seem to think thatif huge masses of data are correctly stored indatabase systems, properly integrated and anal-ysed, comprehensive and valid biological knowl-edge will emerge. This is expressed by termssuch as ‘discovery-driven science’, as opposed to‘hypothesis-driven research’.

In the seventeenth century, Sir Francis Bacon[8] thought that if all known facts are systemat-ically ordered, a theory of the underlying systemcould be derived and verified by induction. DavidHume, and later Karl Popper, argued that this strat-egy for arriving at scientific knowledge was erro-neous. True progress in science comes about byposing a hypothesis based on existing incompleteknowledge and testing the hypothesis by trying tofalsify it in carefully designed experiments whichyield new data to fill in gaps [3,6]. If the hypothe-sis could not be falsified, then the hypothesis wasconsidered corroborated. Definitive logical proofof the correctness of a hypothesis was unattain-able. However, the continuing cycle of testing ofprogressively refined hypotheses reflects the truenature of scientific progress, in Popper’s view.

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 201–204.

Page 3: Conference Review Data integration and analysis for medical …downloads.hindawi.com/journals/ijg/2004/467876.pdf · 2019. 8. 1. · Comparative and Functional Genomics Comp Funct

Data integration and analysis for medical systems biology 203

Given the existence of database technology forestablishing and coupling large databases, manyscientists now seem to expect a lot from detectingmeaningful relations in the databases by computermethods. Clustering of groups of genes with similargene expression patterns across multiple experi-ments is an example. However, such correlationsshould lead to new hypotheses and theories onthe organization of the underlying biological sys-tem, which still require corroboration. Although thedata-driven part of the research is very useful, thehypothesis-driven part must follow to lead to validknowledge.

The term ‘data mining’ suggests that lots ofrubble and rock without value will be dug upalong with precious metal. Figuratively speaking,ways of separating shining nuggets of gold fromthe stone in which they are buried are then aprerequisite for a profitable process. If thousandsof molecular changes occur, many correlationsare expected based on random fluctuations. Datamining thus supports the inductive part of thescientific process: correlations are found, but it hasyet to be determined whether relations are causal.

This fundamental difficulty is at present com-pounded by the practical problem that the higherdensity of data often seems to come with lowerprecision and accuracy. It required great care toobtain ‘old-fashioned’ low-density laboratory mea-surements, such as biochemical assays. If we per-form hypothesis-driven research, a lot of attentionis directed to those measurements that are criticalfor testing the hypothesis. If such a focus on a lim-ited dataset is lacking, special attention is requiredfor data reliability during mass production of high-throughput data. As is true for the mass productionof goods, quality control becomes a necessary step.

In the worst case, analysis of large, hetero-geneous, and to some extent unreliable, datasetsmight produce a much too large proportion ofspurious correlations to be helpful. In the idealcase, with accurate high-throughput measurementsrecorded error-free in databases, the experimentalwork goes forward at tremendous speed, but thequestion is raised whether data interpretation andunderstanding can keep up with this. Hypothesesare easy to generate, and proliferate even fasterthan the data needed to critically examine them,as Robert Pirsig eloquently explained in his novel[5]. Indeed, at a recent genomics meeting, Holstegeidentified the challenge that in genomics the rate of

generation of hypotheses is faster than the rate ofverification [9]. Thus, the trouble with data analysisof high-throughput data might become that manymore hypotheses can be derived from patterns inthe data than can be critically examined.

Bottom-up and top-down data mining

Analysis of large integrated databases of experi-mental data is going to be an inevitable develop-ment. To think of an analogy: while explorationsof the earth were done in past centuries by ship,perhaps based on hypotheses of some kind (‘if wego west, we will find a new route to India’), it isdefinitely not necessary to pose a hypothesis beforestarting to chart the earth with sensors and imagingequipment using satellites in orbit. However, it isnot yet entirely clear how we can circumvent thelimitations of the inductive mode of data mining inbiomedical databases and follow this up with thenecessary critical testing of the hypotheses that aregenerated. It becomes necessary to analyse the inte-grated databases, not only with inductive methodsbut also to test hypotheses at a high rate. The inte-gration of inductive and deductive reasoning fordata mining has been described in the context offinancial and commercial data [7]. The inductivepattern discovery part is termed ‘bottom-up datamining’, the hypothesis-testing part was termed‘top-down data mining’.

A relevant philosophical question is whether, ifthe high-density molecular measurements cover acritical fraction of all the molecules in the sys-tem under study, the inductive method can to someextent replace the cycle of hypothesis falsificationand formulation of improved hypotheses. Given thelarge number of molecules present in biologicalsystems, it will be very difficult to keep track of thehypotheses necessary to cover so many molecularmeasurements. If we were to include all the molec-ular details, the comprehensive hypothesis wouldmost likely be wrong in at least some of the details.

When searching on the World Wide Web it isnot difficult to find statements such as ‘Biologyis data-driven science’. However, if the next stepof critically investigating hypotheses is neglected,biology may become Kuhn’s morass. Thus, thedevelopment of top-down data mining, i.e. hypoth-esis testing, for the analysis of high density biolog-ical data is important.

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 201–204.

Page 4: Conference Review Data integration and analysis for medical …downloads.hindawi.com/journals/ijg/2004/467876.pdf · 2019. 8. 1. · Comparative and Functional Genomics Comp Funct

204 J. H. G. M. van Beek

Not the trees, but the forest

With regard to the multiple hypothesis testingproblem, where a large number of false positiveanswers arise when a huge number of tests isperformed simultaneously, there may be variousanswers. Some degree of coarse graining may behelpful for some questions. The measurement ofthe level of a single molecule often does not yieldthe answer, because it is an interconnected parallelchange in many molecules belonging to a pathway.When correlated changes between two moleculesappear while testing many possible combinationsof molecules, this may be due to random fluctua-tions, but when many molecules belonging to thesame pathway change in a certain direction thisprovides a more reliable signature of a meaningfulchange in the system. Therefore, at the CMSB suchinterconnected changes will be used for prognosis,diagnosis and classification of disease.

Alternatively, one can concentrate on large cor-relations or changes whose magnitude is such thaton statistical grounds less than one instance of atleast that likelihood is found in the total integrateddataset under the null hypothesis, i.e. without a realunderlying change or relation. This is analogous tousing small E-values for selecting sequence align-ments from a BLAST search. For large datasets thisis a much more stringent criterion than the commoncriteria for significance (traditionally p < 0.05 or<0.01). However, the E-value has great practicalvalue: if there is one true relation in the datasetand the ‘E-value’ used is 10, the ratio of true tofalse positives is 1 to 10. The task of weeding outthe false positives becomes uncomfortably large athigh ‘E-values’.

To analyse the data, it is particularly worth-while to investigate how much of the measuredchanges can be predicted from reliable prior bio-logical knowledge, preferably formalized in a com-putational model. If debatable assumptions have tobe introduced into the model to explain measureddata, these constitute new hypotheses to be testedwith new results. Critical re-examination of mea-sured data is sometimes also indicated and helpsin data quality control. It will be a big challenge

for the future to build a reliable model for sizableparts of the whole biomolecular system.

Conclusion

Integration of databases containing experimentaldata in genomics and systems biology is goingto be an inevitable development. When accuratehigh-throughput measurements speed up the exper-imental part of the scientific discovery cycle, theinterpretation and analysis part of the scientific pro-cess will become more limiting. Many data-miningtechniques for use on the integrated databases areinductive in nature and may help the formulation ofhypotheses. However, creative scientific reasoning,the design of new experiments, and critical testingof hypotheses, theories and computational modelsremain of vital importance now that data collectionis increased in scale.

Acknowledgements

Work in the Centre for Medical Systems Biology issupported by a Centre of Excellence grant from theNetherlands Genomics Initiative.

References

1. Centre for Medical Systems Biology: http://www.cmsb.nl2. Kuhn TS. 1970. The Structure of Scientific Revolutions, 2nd

edn. University of Chicago Press: Chicago, IL.3. Magee B. 1973. Popper. Collins: London.4. Netherlands Genomics Initiative: http://www.genomics.nl5. Pirsig RM. 1974. Zen and the Art of Motorcycle Maintenance.

Corgi Books: London.6. Popper KR. 1976. Unended Quest. Library of Living

Philosophers. Fontana: London.7. Simoudis E, Livezey B, Kerber R. 1996. Integrating inductive

and deductive reasoning for data mining. In Advances inKnowledge Discovery and Data Mining, Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds). AAAI Press: MenloPark, CA; and MIT Press: Cambridge, MA; 353–373.

8. Sir Francis Bacon. 1620. The New Organon or True DirectionsConcerning the Interpretation of Nature: http://www.consti-tution.org/bacon/nov org.htm

9. Wixon J, Marsh J. 2003. Meeting Report: ESF Programmeon Functional Genomics 1st European Conference: FunctionalGenomics and Disease. Comp Funct Genom 4: 549–557.

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 201–204.

Page 5: Conference Review Data integration and analysis for medical …downloads.hindawi.com/journals/ijg/2004/467876.pdf · 2019. 8. 1. · Comparative and Functional Genomics Comp Funct

Submit your manuscripts athttp://www.hindawi.com

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttp://www.hindawi.com

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

Microbiology