ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn,...

119
D ESIGN AND D EVELOPMENT OF A B IOINFORMATICS P LATFORM FOR C ANCER I MMUNOGENOMICS ROBERT MOLIDOR DOCTORAL T HESIS Graz University of Technology Institute for Genomics and Bioinformatics Petersgasse 14, 8010 Graz, Austria Graz, December 2004

Transcript of ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn,...

Page 1: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

DESIGN AND DEVELOPMENT OF ABIOINFORMATICS

PLATFORM FOR CANCER IMMUNOGENOMICS

ROBERT MOLIDOR

DOCTORAL THESIS

Graz University of Technology

Institute for Genomics and Bioinformatics

Petersgasse 14, 8010 Graz, Austria

Graz, December 2004

Page 2: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

AbstractThe objective of this thesis was to develop a bioinformatics platform for cancer immunogenomics which

enables the classification of patients suffering from cancer based on molecular markers and gene expres-

sion profiles into clinically useful categories. Furthermore, the discovery as well as the evaluation of

markers applied in cancer research should be facilitated.

Therefore we have developed a comprehensive and versatile platform consisting of the Microarray Anal-

ysis and Retrieval System (MARS) and the Tumoral MicroEnvironment Database (TME.db), which are

Web based databases built on a 3-tier architecture implemented in Java 2 Enterprise Edition (J2EE). This

unique combination brings about a complex data management system, which integrates public and pro-

prietary databases, clinical data sets, and results from high-throughput screening technologies.

MARS is a MIAME compliant microarray database that enables the storage, retrieval, and analysis of

multi-color microarray experiment data produced by multiple users and institutions. It covers the com-

plete microarray workflow starting from microarray production to the concluding data analysis. By pro-

viding a Web service interface and Application Programming Interfaces (API), third party applications

can connect to MARS to query data, perform high-level analysis, and finally write back the transformed

data sets. The intention of TME.db is to provide researchers access to means and data particularly rel-

evant in tumor biology and immunology, diagnosis management and the treatment of cancer patients.

TME.db integrates clinical data sets and results from high-troughput screening technologies with special

emphasis on phenotypical and functional data gathered using flow cytometry analysis. Furthermore, due

to the integration of TME.db and MARS it is possible to directly link gene expression data stored in

MARS to patients in TME.db. Since TME.db is handling sensitive patient related data, special attention

has been payed on security and in this context on arising ethical, legal, and social implications to avoid

that specific patients could be traced back by unauthorized people.

As it has been shown in preliminary studies, this unique combination of Web based applications and

the resulting integration of microarray and flow cytometry data, clinical data sets, and proposed anal-

ysis methods provides powerful means to successfully accomplish immunogenomic studies that may

ultimately help to decipher cancer related immunological questions and improve cancer diagnosis and

treatment.

Keywords: microarray, tumoral microenvironment, database, cancer, immunogenomics, MIAME, J2EE

Page 3: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Publications

This thesis is based on the following publications as well as upon unpublished observations:

Papers

Maurer M, Molidor R, Sturn A, Hartler J, Prokesch A, Scheideler M, and Trajanoski Z. MARS:

Microarray Analysis and Retrieval Systemin preparation

Molidor R, Mlecnik B, Maurer M, Galon J, and Trajanoski Z. TME.db - Tumoral MicroEn-

vironment Databasein preparation

Molidor R, Sturn A, Maurer M, and Trajanoski Z. New Trends in Bioinformatics: From Genome

Sequence to Personalized Medicine.Experimental Gerontology, 38(10): 1031-1036, 2003

Book Chapters

Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-

agement of Pharmacogenomic Information in Pharmacogenomics Methods and Protocols, Hu-

mana Press, Totowa, USA 2004in press

Conference Proceedings and Abstracts

Molidor R, Mlecnik B, Maurer M, Galon J, Trajanoski Z. TME.db-Tumoral Microenvironment

Database, OEGBMT Symposium, Graz, Austria, 2004

Sturn A, Maurer M, Molidor R, Pieler R, Rainer J, Trajanoski Z. MARS: Microarray Analysis

and Retrieval System, Keystone Symposia: Biological discovery using diverse high throughput

data, Keystone, CO, USA, 2004

i

Page 4: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

ii

Maurer M, Molidor R, Sturn A, Trajanoski Z. MARS: Microarray Analysis and Retrieval Sys-

tem, 6th International Meeting of the Microarray Gene Expression Data Society (MGED6),

Aix-en-Provence, France, 2003

Pieler R, Sanchez-Cabo F, Hackl H, Thallinger G, Molidor R, Trajanoski Z. ArrayNorm: A

versatile Java tool for microarray normalization, 6th International Meeting of the Microarray

Gene Expression Data Society (MGED6), Aix-en-Provence, France, 2003

Maurer M, Gold P.W, Hartler J, Martinez P, Molidor R, Prokesch A, Trajanoski Z. eSCID:

A Relational Database System for Mental Health Clinical Research, 3rd Forum of European

Neuroscience (FENS 2002), Paris, France, 2002

Molidor R, Galon J, Hackl H, Maurer M, Ramsauer T, Sturn A, Trajanoski Z. Identifying

Gene Expression Patterns in Anergic T-Cells and Tumoral Micro-Environment T-Cells, Aus-

trian Academy of Sciences, Vienna, Austria, 2002

Page 5: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Contents

List of Figures v

List of Tables vii

Glossary viii

1 Introduction 1

1.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Tumor Microenvironment. . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Classification and Analysis. . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.4 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Methods 7

2.1 Biological Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 DNA Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Flow Cytometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Computational Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Java 2 Enterprise Edition. . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Relational Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 Design Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.4 Model Driven Architecture. . . . . . . . . . . . . . . . . . . . . . . . 32

2.2.5 Authentication and Authorization. . . . . . . . . . . . . . . . . . . . 33

iii

Page 6: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

CONTENTS iv

3 Results 34

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 MARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.2 Asynchronous Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.3 Report Beans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.4 Web Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.5 MARS Management using JMX. . . . . . . . . . . . . . . . . . . . . 45

3.3 TME.db . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.2 Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.3 Integration of MARS and TME.db. . . . . . . . . . . . . . . . . . . . 52

3.4 Microarray Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5 Analysis of Phenotypic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Discussion 60

References 65

Acknowledgements 74

Publications 75

Page 7: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

List of Figures

1.1 The tumor microenvironment: Molecular cross-talk at the invasion front. . . . 3

2.1 Two-color cDNA microarray analysis principle. . . . . . . . . . . . . . . . . 8

2.2 The principle of oligonucleotide array analysis. . . . . . . . . . . . . . . . . . 9

2.3 Principle components of a flow cytometer. . . . . . . . . . . . . . . . . . . . 12

2.4 Flow cytometry: histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Flow cytometry: dot plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 3-tier architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Web services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.8 JMX-architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.9 The core J2EE patterns.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.10 The Data Access Object pattern.. . . . . . . . . . . . . . . . . . . . . . . . . 30

2.11 The Value List Handler pattern.. . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.12 MDA pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 Microarray workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 MARS plate handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 MARS rawbioassay data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 MARS status information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Class diagram of the ’Report Beans’. . . . . . . . . . . . . . . . . . . . . . . 43

3.6 Sequence diagram of the ’Report Beans’. . . . . . . . . . . . . . . . . . . . . 44

3.7 ArrayNorm using MARSExplorer. . . . . . . . . . . . . . . . . . . . . . . . 46

3.8 TME.db patient information. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.9 TME.db experiment management. . . . . . . . . . . . . . . . . . . . . . . . 49

3.10 TME.db sample management. . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.11 TME.db custom queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

v

Page 8: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

LIST OF FIGURES vi

3.12 Integration of TME.db and MARS. . . . . . . . . . . . . . . . . . . . . . . . 53

3.13 Quality control using MA- and Regression-plots. . . . . . . . . . . . . . . . . 55

3.14 Comparison of duplicated spots using HCL. . . . . . . . . . . . . . . . . . . 55

3.15 Correspondence Analysis applied to microarray experiment data.. . . . . . . . 56

3.16 Comparison of lung and colon cancer patients. . . . . . . . . . . . . . . . . . 57

3.17 Sample material comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.18 Sample material comparison for colon cancer patients. . . . . . . . . . . . . . 59

3.19 Significant markers between tumor and not tumoral samples. . . . . . . . . . 59

Page 9: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

List of Tables

2.1 Possible FACS tube list to accomplish phenotypic analysis.. . . . . . . . . . . 14

3.1 Overview of conducted microarray hybridizations.. . . . . . . . . . . . . . . 54

3.2 Stimulation effects using a combination of CD3, CD28, IL-10, or IL-2. . . . . 54

3.3 Overview of patients/donors stored in TME.db.. . . . . . . . . . . . . . . . . 57

vii

Page 10: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Glossary

AAS Authentication and Authorization System

ACL Access Control List

API Application Programming Interface

bp base pair

Cancer A group of diseases characterized by uncontrolled cell division

cDNA complementary DNA

CD Cluster of differentiation

CD4 A glycoprotein that serves as a coreceptor on MHC class I-restricted T-cells.

Most helper T-cells are CD4+

CD8 A dimeric protein that serves as a coreceptor on MHC class I-restricted T-cells.

Most cytotoxic T-cells are CD8+

CPU Central Processing Unit

DBMS Database Management System

DNA Deoxyribonucleic acid

Dukes Staging System especially applied for colorectal cancer

EIS Enterprise Information System

EJB Enterprise Java Bean

EST Expressed Sequence Tag

FACS Fluorescently Activated Cell Sorter

GUI Graphical User Interface

HTTP HyperText Transfer Protocol

HTTPS Secure version of HTTP

HUPO Human Proteome Organization

IIOP Internet Inter-Orb Protocol

IP Internet Protocol

viii

Page 11: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

GLOSSARY ix

JDBC Java Database Connectivity

JMS Java Messaging Service

JMX Java Management Extensions

JSP Java Server Page

J2EE Java 2 Enterprise Edition

LIMS Labor Information Management System

MAGE-ML Microarray Gene Expression Markup Language

MDA Model Driven Architecture

Metastasis Is the spread of cancer from its primary site to other places in the body

MGED Microarray Gene Expression Data Society

MHC A complex of genes encoding cell-surface molecules that are required for antigen

presentation to T cells and for rapid graft rejection.

MIAME Minimum Information About a Microarray Experiment

mRNA Messenger ribonucleic acid

Neoplasm Any new and abnormal growth; a benign or malignant tumor

NK-cell Natural Killer Cell

PCR Polymerase Chain Reaction

RMI Remote Method Invocation

RNA Ribonucleic acid

SAGE Serial Analysis of Genes

SMTP Simple Mail Transfer Protocol

SOAP Simple Object Access Protocol

SQL Structured Query Language

SSL Secure Sockets Layer

TCP Transmission Control Protocol

TNM Staging System which describes the primary tumor (T), whether or not the cancer

has reached nearby lymph nodes (N), and whether there are distant metastases (M)

UDDI Universal Description, Discovery, and Integration

UML Unified Modeling Language

URL Universal Resource Locator

WSDL Web Service Description Language

XML Extensible Markup Language

Page 12: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Chapter 1

Introduction

1.1 Background

Studies of the immunological response to tumors in the 20th century were paralleled by skep-

ticism concerning the existence of an immune response and doubt concerning the applicability

in humans. Within the past decade, increasing information about the molecular basis of tumor-

host interactions became available resulting from basic studies in cellular immunology. Along

with sophistication in biotechnology, these information opened novel possibilities for the devel-

opment of effective immunotherapies for patients with cancer.

Studies examining CD8+ and CD4+ T-cell responses to melanoma and other tumors have un-

covered several classes of proteins that give rise to peptides presented by MHC (major his-

tocompatibility complex) molecules [1]. The demonstration that evasive tumors can undergo

regression after appropriate immune stimulation by IL-12 (interleukin 12) has shown that it

is indeed possible to treat cancer by immune manipulation [2, 3]. Several lines of evidence

now indicate that an endogenous immune response against cancer develops in some patients.

Tumor-infiltrating lymphocytes can be isolated and expanded in vitro with IL-12. Furthermore,

tumor-associated antigen specific T-cells have been detected in some lymph nodes and it was

suggested that antigen-specific unresponsiveness might explain why such cells are unable to

control tumor growth [4]. These recent advances in tumor immunology have fostered the clin-

ical implementation of different immunotherapy modalities. However, the complexity of the

immune system network as well as the multidimensionality of tumor-host interactions are still

poorly understood. This can also be seen in the 5-year survial rate which is 63% combined for

all cancers, but with a great variance between cancer type, stage, and diagnosis [5].

1

Page 13: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

1.1. BACKGROUND 2

1.1.1 Cancer

Cancer is a complex disease that involves several genes during development and progression.

All cancers begin in cells and are a result of abnormal changes to the genetic material contained

in those cells. Usually in most organs and tissues of a mature organism, a balance between

cell renewal and cell death is maintained. A single cell or group of cells can undergo genetic

events such as mutations, influenced by inherited or environmental factors as well as a result

of certain levels of hormones or growth factors, which may change the cells behavior. Since

mature cells in the body have a given life span, new cells are generated by proliferation and

differentiation when old cells die. Under normal circumstances the number of a particular cell

type remains constant. Aneoplasmor tumor is formed when new cells are produced but old

cells do not die. A tumor can either bebenignor malignant. In contrast to benign tumors, which

are not capable of indefinite growth, malignant tumors (cancers) continue to grow and become

progressively invasive. The process where cancerous cells dislodge from a tumor, invade the

lymphatic or blood vessels and are carried to other tissues, where they continue to proliferate, is

calledmetastasis. The primary classification of cancers is based on the embryonic origin of the

tissue from which the tumor is derived. Adenomas originate from glandular tissue, carcinomas

emerge from epithelial cells, leukemia starts in the bone marrow stem cells and a lymphoma is

a cancer originating in lymphatic tissue. Melanoma arises in melanocytes and sarcoma begins

in the connective tissue of bone or muscle [6].

1.1.2 Tumor Microenvironment

During the whole tumor aetiology, progression, and final metastasis the tumor microenviron-

ment of the local host tissue represents an active participant in tumor-host interaction [7]. An

invasive carcinoma is viewed as a pathology of multiple cell societies inhabiting the epithe-

lial/mesenchymal stromal unit. Preceding the transition to an invasive carcinoma, host fibrob-

lasts, immune and endothelial cells are activated. During invasion cytokine and enzyme ex-

change between the participating stromal and premalignant epithelium cells stimulates the mi-

gration of both cell types towards each other. This results in a modification of the extracellular

matrix (ECM) and finally in a breakdown of the normal tissue boundaries (see figure1.1).

Page 14: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

1.1. BACKGROUND 3

Figure 1.1: Molecular cross-talk at the invasion front. Cytokine and enzyme exchange betweenthe participating cells stimulates migration of tumor and stroma cells towards each other andmodifies the adjacent extracellular matrix/basement membrane. The result is a breakdown ofnormal tissue boundaries [7].

Page 15: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

1.1. BACKGROUND 4

1.1.3 Classification and Analysis

During the past century, the clinical behavior of human cancer has been predicted using its

microscopic appearance. Unfortunately this approach can only predict general categories, but

not accurately subcategorize cancer or even predict the clinical outcome. The usage ofserum

markersfor cancer screening has led to the perception that single markers will be the answer

to early diagnosis and prognosis. Although this approach is valuable, it definitely is an over-

simplification of cancer aetiology and progression [8]. Therefore, tumor immunology might

greatly benefit from high-troughput technologies like cDNA microarray analysis [9] and SAGE

(serial analysis of gene expression) [10] which allow to analyze molecular differences between

normal and cancer cells at a genome-wide level in a comprehensive and unbiased way. These

technologies are promising for directly analyzing tissue [8, 11], but currently require quite a

large amount of sample material. Tumor-based gene expression data from DNA microarrays

add immense detail and complexity to the information available from traditional clinical and

pathological sources and furthermore provide a snapshot of the total gene activity of a tumor.

Complex and detailed data on the inherent state of the patient and on the current characteris-

tics of the tumor disease state are provided [12] via the measurement of the mRNA expression

level of thousands of genes simultaneously in a single assay [13, 14]. Additionally, several

studies [15,16,17,18,19,20,21,8] demonstrated that it is possible to reveal biological and/or

clinically relevant subgroups of tumor using gene expression profiling and thus improve the

understanding of oncogenesis. Besides these high-troughput technologies, methods like FACS

(Fluorescently Activated Cell Sorting) [22] or ELISA (Enzyme-Linked Immunosorbent As-

say) [23] are well established in the field of cancer immunology. In contrast to DNA microar-

rays, FACS experiments produce information about many different molecules in a single cell

and thus allow also to differentiate between different stages of normal development and between

different malignancies [24].

While all these types of investigations provide important information about the immune system

and the complex tumor-host interactions per se, much more conclusions can be made when

these data will be integrated and merged together with clinical records and patient related data

like CT- and MRI scans, mammography or ultrasound. With the advent of bioinformatics and

further advances in computing technologies, the objective is to provide data and tools necessary

to identify key components that play a role in cancer and to discover changes contributing to

cancer progression and resistance to therapy [25,26].

Page 16: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

1.1. BACKGROUND 5

1.1.4 Data Integration

Complex data management systems can facilitate the identification of key components and the

discovery of tiny differences between certain types of cancer to improve diagnosis and treat-

ment. These systems have to integrate public and proprietary databases, clinical data sets, and

results from high-troughput screening technologies. Furthermore sophisticated data mining

tools have to be provided and integrated into these systems [27,28,29]. The establishment of

such complex systems represents significant challenges for both designers and administrators

and in general requires a high performance computing environment including parallel process-

ing systems, sophisticated storage and network technologies, database and database manage-

ment systems, and application services. An additional layer of complexity is introduced by

arising security issues concerning the sensitivity of certain types of data. Even if anonymiza-

tion is enforced, a specific person could be traced back either exactly or probabilistically, due

to the amount of remaining information [30]. Therefore special attention has to be payed on

Ethical, Legal and Social Implications (ELSI) [31,32]. The management and integration of het-

erogeneous data sources with widely varying formats and different object semantics is a difficult

task due to the lack of international as well as national standards [33,34]. Currently, several ef-

forts are underway to standardize relational data models and/or object semantics in the field

of biomedical- and molecular biology databases [35]. In the area of microarray databases for

instance [36], the MGED (Microarray Gene Expression Data) Society (http://www.mged.org)

proposes with MAGE an object model and with MIAME (Minimum Information About a Mi-

croarray Experiment) [37] a standard to describe the minimum information required to un-

ambiguously interpret, repeat, and verify microarray experiments. Additionally, the Human

Proteome Organization (HUPO) (http://www.hupo.org) is currently engaged to define commu-

nity standards for data representation in proteomics to facilitate data comparison, exchange and

verification [38]. The BioPathways Consortium is elaborating a standard exchange format to

enable sharing of pathway information, such as signal transduction, metabolic and gene reg-

ulatory pathways (http://www.biopathways.org). In addition, the Gene Ontology Consortium

(http://www.geneontology.org) provides a structured and standardized vocabulary to describe

gene products in any organism [39]. In clinical settings SNOMED (http://www.snomed.org) or

ICD (http://www.icd.org) have been established for a standardized classification of disease and

health related problems [40].

Page 17: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

1.2. OBJECTIVES 6

1.2 Objectives

The ultimate goal of this thesis was to develop a system for cancer immunogenomics which

enables the classification of patients suffering from cancer based on molecular markers and gene

expression profiles into clinically useful categories. Furthermore, this system should facilitate

the discovery as well as the evaluation of markers applied in cancer research. By gathering,

integrating, and final bioinformatic analysis of genomic and classical immunological data, the

system should enable to elucidate fundamental patterns intrinsic in cancer.

Hence the following specific aims were defined:

• Design and implementation of the Tumoral MicroEnvironment Database (TME.db) which

enables

– the management of sensitive patient related data (medical history, surgery, treatment,

etc.),

– the management of experimental data coming from various types of investigations,

– the maintenance of the used sample materials and their treatment, and

– the integration of sophisticated query mechanisms to facilitate analysis.

• Design and development of a MIAME compliant Microarray Analysis and Retrieval Sys-

tem (MARS) that includes

– a laboratory information management system (LIMS) to keep track of all procedures

during the microarray production process,

– a laboratory notebook to store and retrieve all information during biomaterial ma-

nipulation,

– an application programming interface and a Web service interface to be able to con-

nect tools like normalization, clustering, or pathway annotation applications.

• Integration of TME.db and MARS.

Page 18: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Chapter 2

Methods

2.1 Biological Methods

2.1.1 DNA Microarrays

Gene expression profiling using DNA microarrays has become an essential tool in biomedical

research in the last years. Monitoring expression levels for thousands of genes in parallel pro-

vides insights into cellular processes and responses that cannot be obtained by looking at one

or a few genes [41,42].

The two major types of nucleic acid arrays used in gene expression profiling are a) spotted ar-

rays and b)in situ or oligonucleotide arrays. Spotted arrays are created by robotic deposition

of nucleic acids (PCR products, plasmids or oligonucleotides) with a length up to 3,000 bp

onto a glass slide either by contact or inkjet printing. Forin situ arrays copies of each selected

oligonucleotide (usually 20 to 25 nucleotides in length) are synthesized via photolithography

and combinatorial chemistry techniques. In this approach, each gene or EST (expressed se-

quence tag) is represented by a perfect match (PM) and a mismatch (MM) oligonucleotide

sequence on the array. The PM and MM are differing only by a single base in the center po-

sition of their sequence. The purpose of the MM is to capture non-specific binding that would

otherwise distort the measured signal of the PM [43]. In general, spotted arrays are used for

two-color analysis where RNA samples are labeled separately with different fluorescent tags

like cyanine 3 (Cy3) and cyanine 5 (Cy5). After labeling they are mixed and hybridized to a

single array and afterwards the slide is scanned to generate fluorescent images from two chan-

nels. The graphical overlay of these two channels can then be used to visualize activated or

7

Page 19: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.1. BIOLOGICAL METHODS 8

Figure 2.1: For two-color analysis two populations of cells (treated vs. untreated, healthy vs.diseased, etc.) are prepared. RNA is isolated and each sample is labelled with a differentfluorochrome. The samples are pooled and hybridized to a single microarray. Non-specificprobes are washed away and the fluorescent image is detected using a scanner [48].

repressed genes (see figure2.1). In contrast, for one-color analysis expression profiles of each

sample are generated on a different chip using a single fluorescent label (e.g., phycoerythrin)

and then the resulting different images are compared (see figure2.2) [44]. The image acquisition

step can be accomplished using a variety of fluorescent detection technologies including instru-

ments containing confocal optics, photomultiplier tubes (PMTs), and charge-coupled devices

(CCD) [45]. Generally all these kind of scanners produce images in standard tagged-image file

format (TIFF) [46], which are two dimensional, 16-bit numerical representations of microarray

surfaces, with intensity values ranging from 0 to 65,535. Thereafter, the image analysis step

assigns an expression value to each gene. The resulting data is typically saved in tab-delimited

text format to allow further processing. Both, the one- and the two-color analysis strategy,

can be used to compare tissue types such as heart versus brain, normal versus diseased tissue

samples, or time-course experiments, where a sample is subjected to different treatments or

conditions [47]. During the last years further technologies for gene expression profiling have

evolved. Baum et al. [49] described a novel microarray platform that integrates all functions

needed to perform any array-based experiment in a compact instrument on a researcher’s lab-

oratory benchtop. Oligonucleotide probes are synthesized in situ via a light-activated process

within the channels of a three-dimensional microfluidic reaction carrier. Arrays can be designed

and produced within hours according to the user’s requirements and are processed in a fully au-

Page 20: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.1. BIOLOGICAL METHODS 9

Figure 2.2: One-color analysis uses a single fluorochrome and two chips to generate expressionprofiles for two or more cell samples [48].

tomatic workflow. Wu et al. [50] presented a novel microarray technology utilizing a porous

aluminum-oxide substrate for rapid molecular biological testing.

Experimental Design

Since all resulting conclusions and predictions from microarray experiments rely on the quality

of the generated data it is important to understand the crucial steps that can affect the outcome

of the analysis. Besides the used technology a key issue concerning the success or failure of a

microarray experiment is the experimental design [51,52].

The sources of variation in a microarray experiment can be partitioned into three layers [53].

First, there is thebiological variationwhich is intrinsic to any organism. It may be influenced

by genetic or environmental factors, as well as by wether the samples are pooled or individ-

ual. To reduce the biological variability it is necessary to perform repeated hybridizations with

RNA samples from independent sources. Second, thetechnical variationis introduced during

extraction, labeling and hybridization of samples. To overcome these shortcomings at least two

samples should be obtained from one biological source to increase precision and to provide a

basis for testing differences within a sample group. Third, themeasurement erroris associated

with reading fluorescent signals, which may be affected by factors such as dust on the array.

Duplication of spotted clones on a slide should increase precision, as well, and should provide

quality control and robustness to the experiment.

Page 21: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.1. BIOLOGICAL METHODS 10

Normalization and Data Analysis

The hypothesis underlying microarray analysis is that the measured intensities for each arrayed

gene represent its relative expression level. Before these levels can be compared appropri-

ately, a number of transformations must be carried out on the data to eliminate questionable

or low-quality measurements, to adjust the measured intensities, and to select genes that are

significantly regulated between classes of samples. The experimental design and usage of bio-

logical and/or technical replicates affects the choice of the normalization methods. Generally,

normalization tries to remove all non-biological variation introduced in the measurements. This

can be achieved either by using self-consistency methods like global normalization, linear re-

gression, LOWESS (Locally Weighted Linear Regression), or by using quality elements such

as self-normalization, controls, and reference channel methods [54,51,55,56,57].

Following normalization, data have to be analyzed to identify genes that are differentially ex-

pressed. Most of the published studies measured differential expression by the ratio of expres-

sion levels where ratios above a fixed cut-off were said to be differentially expressed [58,59].

Since this fold-change approach is statistically inefficient and leads to an increase in false-

positive or false-negative rates [52], several methods based on variants of common statistical

tests have been introduced [60]. In general, these methods consist of calculating a test statistic

and determining the significance of the observed statistic. A standard test for the detection of

differential expression is at-test or more commonANOVA F, if multiple groups are involved.

Besides these parametric tests, which only work on data following a normal distribution, non-

parametric rank-based tests like Mann-Whitney [61] can also be used [62]. To discover genes

following similar expression patterns, cluster analysis methods can be used [63]. The most

prominent are hierarchical clustering [64], k-means [65], self-organizing maps [66], principal

component analysis [67], and support-vector machines [68]. All these methods allow simple

discrimination of differentially expressed genes, functional annotation of coexpressed genes,

diagnostic classification, and investigation of regulatory mechanisms.

Gene expression profiling of Jurkat T-cells and intratumoral T-cells

For gene expression profiling, Jurkat T-cells are cultivated and RNA from them as well as from

samples drawn from patients or healthy volunteers has been extracted.

The Jurkat T-cell line is cultured using RPMI 1640 with Glutamax-I (Invitrogen™ Life Tech-

nologies, 95613 Cergy Pontoise Cedex, France, Europe) as medium. To this medium 5 vol%

Page 22: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.1. BIOLOGICAL METHODS 11

of Penicillin (5 units/ml) / Streptomycine (5µg/ml) (Invitrogen™ Life Technologies) and 10

vol% fetal calf serum (FCS) is added. Jurkat T-cells usually grow between 0.5 and 2 million

cells per ml. To get T-cells from patients and healthy donors the lymphocytes are isolated, first.

Therefore the common Ficoll-Hypaque™ centrifugation protocol using lymphocyte separation

medium from Eurobio (91953, Les Ulis Cedex B, France) is applied. T-cell isolation is done

using an indirect magnetic labeling system (Pan T Cell Isolation Kit, Miltenyi Biotec, 75011

Paris, France) usually resulting in more than 95% pure T-cells.

Before extracting RNA from T lymphocytes, a FACS analysis is performed to ensure that there

are enough T-cells available. This analysis is conducted with CellQuest Pro (BD Biosciences,

San Jose, CA, USA 95131-1807). For RNA extraction currently two different kits are in use.

The RNA-plus™ kit (Qbiogene, 67402 Illkirch Cedex, France, Europe) is primarily used to ex-

tract RNA from the Jurkat T-cell line, while the RNAgents® Total RNA isolation kit (Promega

Corporation, Madison, WI 53711-53399, USA) is used for the T lymphocytes gained from pa-

tients and healthy donors. RNA concentration is measured with a spectrophotometer. The UV

absorbance at 260 nm and 280 nm is detected and the ratio is calculated. To ensure the quality

of the RNA common 1% Agarose Gels are done and additionally the Bioanalyzer 2100 with the

RNA LabChip® (Agilent Technologies, 91745 Massy, France, Europe) is used.

Microarrays are done using the Human 1 cDNA Microarray Kit (Agilent Technologies). This

commercial array contains more than 14,000 unique clones sourced from Incyte’s LifeSeq Foun-

dation 1 & 2 clone sets and additionally various control samples. To get labeled cDNA out of

total RNA Agilent’s Fluorescent Direct Label Kit (Agilent Technologies) is used. For hybridiza-

tion and washing the protocol provided by Agilent Technologies is followed. After labeling

cDNA with Cy3 (probe) and Cy5 (reference), the solutions are mixed well and incubated at

98°C for two minutes to denature cDNA. Afterwards the hybridization mixture is cooled down

to room temperature and centrifuged at about 13,000 rpm for 5 minutes. Then, the probe is

applied to the microarray and the slide is placed in an hybridization chamber. The chamber is

submerged in a 65°C waterbath for 17 hours. After hybridization, the slides are washed in three

consecutive washing cycles and dried. Microarrays are scanned using an Agilent scanner and

image analysis is performed with Agilent’s feature extraction software.

2.1.2 Flow Cytometry

Flow cytometry has strongly influenced and also revolutionized modern cell biology. It has

evolved to an almost all-purpose method and has a lot of applications in many areas of biomed-

Page 23: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.1. BIOLOGICAL METHODS 12

ical research including immunology, pathology and medicine.

Flow cytometry is a technology that measures intrinsic and evoked optical signals from cells or

cell-sized particles in a moving fluid stream. A flow cytometer or Fluorescent Activated Cell

Sorter (FACS) consists of several different systems that work together to generate the final data

product (figure2.3):

1. A light or excitation source, typically a laser that emits light at a particular wavelength.

2. A liquid flow that moves the suspended cells through the instrument and past the laser.

3. A detector, e.g. a photomultiplier tube (PMT), which is able to measure the brief flashes

of light emitted as cells flow past the laser beam.

Figure 2.3: A concentrated suspension of cells is allowed to react with a fluorescent antibody.The cells in the suspension which is mixed with a buffer are passing one by one through a laserbeam and the fluorescent light emitted as well as the light scattered by each cell is measured.The suspension is then forced through a nozzle which forms tiny droplets containing at most asingle cell. Each droplet is given an electric charge proportional to the amount of fluorescenceof the cell. The droplets are then separated and sorted based on their charge [69,70].

When cells are passing the laser beam, the laser light is deflected from the detector, and this

interruption of the laser signal is recorded. Cells having a fluorescently tagged antibody bound

Page 24: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.1. BIOLOGICAL METHODS 13

to their cell surface antigens are excited by the laser and emit light that is recorded by a second

detector system. In its simplest form, FACS instruments count each cell as it passes the laser

beam and record the level of fluorescence the cell emits.

Attached to the flow cytometer is a computer system which generates the corresponding plots.

These plots, mostly histograms or dot plots, are displaying the amount of light a cell scatters

or emits. Histograms are displaying on the X-axis the intensity of the detected signal and on

the Y-axis the number of cells counted (figure2.4). If two or even more antibodies coupled to

different fluorescent dyes are used, then the data are more usually displayed in the form of a two-

dimensional dot plot (figure2.5), where the fluorescence of one dye-labeled antibody is plotted

against that of a second, with the result that a population of cells labeled with one antibody

can be further subdivided by its labeling with the second antibody. These dot or scatter plots

generally provide more information than histograms.

Figure 2.4: The histogram is displaying fluorescence intensity versus the number of cells. It isused to display the distribution of cells expressing a single measured parameter.

Figure 2.5: The standard dot plot places a single dot for each cell whose fluorescence is mea-sured. They allow the recognition of cells that are ’bright’ for both colors, ’dull’ for one andbright for the other, dull for both, negative for both, and so on and can therefore be used todistinguish between different types of cells depending of the markers used.

Page 25: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.1. BIOLOGICAL METHODS 14

Immunophenotypic Studies

Immunophenotypic studies try to examine different cell populations within a multicomponent

mixture of cells. They are based on the detection of cell surface antigens, which are defined

by reactivity with monoclonal antibodies. All monoclonal antibodies that react with the same

membrane molecule are grouped into a common cluster of differentiation (CD) [71,72,73]. This

means, that monoclonal antibodies are producing different signs which are common within a

particular cell population and they can therefore be used to distinguish between different cell

populations.

Phenotypic Analysis of Intra- and Extratumoral T-cells

The samples drawn from patients or healthy volunteers are treated as described in section2.1.1.

For the phenotypic analysis the prepared cell dilution has to be proportionally put into several

tubes. Each tube’s cell mixture is stained with different antibodies marked with different fluo-

rescent dyes, which subsequently bind to their specific antigen receptors. The list of tubes can

look as described in table2.1. FL1H to FL4H mark the different fluorescent dyes used. IgG1 is

Tube FL1H FL2H FL3H FL4H1 IgG1 IgG1 IgG1 IgG12 CD19 CD56 CD3 CD143 CD4 CD103 CD3 CD694 CD1a CD83 CD45 CD14

Table 2.1: Possible FACS tube list to accomplish phenotypic analysis.

an immunoglobulin which specifically binds to a receptor and thus can be used for calibrating

purposes. To reveal the amount of these special receptors on the cell surface is necessary since

other specific antigens may also bind to these receptors and therefore would falsify the result.

The calibration process starts by inserting tube 1 into the FACS whereby each of the acquired

dot plots indicates a cloud of points lying close together. This cloud is used for the calibration

of the axis of the dot plot by moving it into the left lower quadrant. All following analyses

are using this adjustment, which defines if a cell has positive (e.g. CD3+) or negative (CD8−)

expression of the particular surface markers. As an example, using the tube configuration de-

scribed in table2.1the following cell populations can be distinguished:

• Tube 1: used for calibration purposes.

Page 26: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 15

• Tube 2: CD19+ represents B-cells, CD56+ NK-cells, CD3+ are T-cells, and CD14+ are

monocytes.

• Tube 3: CD3+CD4+ cells are TH-cells, CD3+CD69+ are activated T-cells, therefore

CD3+CD4+CD69+ cells are activated TH-cells whereas CD3+CD103+ represent a sub-

population of regulatory T-cells.

• Tube 4: CD45+ cells represent all hematopoetic cells, CD1a+CD83+ are mature dentritic

cells, CD1a+CD83− immature dentritic cells, and CD14+CD1a− are monocytes.

Proliferation Analysis

The proliferation analysis tries to examine the behavior of cell division and augmentation under

certain activation and stimulation conditions. For this investigation CFSE (Carboxyfluorescein

diacetate, succinimidyl ester) staining is used [74]. CFSE is a non-polar molecule which con-

tains a fluorescein molecule and is able to pass through the cell membrane spontaneously. Inside

the cell CFSE will be modified with cytosolic esterase, this makes the CFSE fluorescent and de-

forms the molecule in a way that it will no longer be able to pass the cell membrane. Within

each cell division the amount of CFSE and thus the measured fluorescent intensity bisects.

The proliferation assays start with the same proportioning of the cell mixture into tubes as

described above. After that, in each tube cells stained with CFSE are stimulated with a differ-

ent combination of cytokines followed by an incubation for several days. To activate cells a

combination of CD3/CD28 or PHA has been used. The FACS experiments are conducted on

different time points during incubation to observe the cells’ behavior in respect of the particular

stimulations. One of the four FACS channels is used to detect the CFSE stained antibodies.

2.2 Computational Methods

2.2.1 Java 2 Enterprise Edition

Today, enterprise applications are delivering services to a broad range of users. These services

should be:

• Highly available, to meet the needs of today’s global business environment.

• Secure, to protect the privacy of users and the integrity of the enterprise.

Page 27: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 16

• Reliable and scalable, to ensure that business transactions are accurately and promptly

processed.

In most cases, these enterprise services are implemented as multitier applications divided into

three principal layers, namely presentation, business, and Enterprise Information System (EIS)

layer [75].

• Presentation logicis about how to handle the interaction between the user and the soft-

ware. This can be as simple as a command line, but most likely this will be a richgraphi-

cal user interface(GUI) or a web browser. The primary responsibility of the presentation

logic is to display information to the user and to interpret commands from the user into

actions upon the domain and data source.

• Thebusiness logicor middle tier is primarily responsible for the work that the application

needs to do. It involves calculations based on inputs and stored data, validation of data

coming from the presentation layer, and figuring out exactly what data source logic to

dispatch depending on commands received from the presentation layer.

• TheEIS layeror data source layer is about to communicate with other systems that carry

out tasks on behalf of the application, like transaction monitors or messaging systems. For

most enterprise applications the biggest piece of data source logic is a database, which is

mainly responsible for storing persistent data.

The Java 2 Platform, Enterprise Edition (J2EE) [76] provides a component-based approach to

the design, development, assembly, and deployment of multitier enterprise applications (see fig-

ure2.6). The used application model encapsulates the layers of functionality in specific types

of reusable components. Business logic is encapsulated in Enterprise JavaBeans (EJB) compo-

nents. Client interaction can be presented through plain HTML web pages, through web pages

powered by applets, Java Servlets, or JavaServer Pages technology, or through stand-alone Java

applications [77]. All these components communicate transparently using various standards like

HTML, XML, HTTP, SSL, RMI, IIOP, and others [78,79]. Besides the reusability, platform-

independent J2EE component-based solutions are not tied to the products and application pro-

gramming interfaces (APIs) of any vendor and therefore properly designed applications can be

deployed on any J2EE compliant application server.

Page 28: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 17

Figure 2.6: 3-tier architecture enforces the separation of presentation, business, and data tier.This architecture is intended to allow any of the three tiers to be upgraded or replaced indepen-dently as requirements change.

Enterprise JavaBeans

Written in the Java programming language, an enterprise bean is a server-side component that

encapsulates the business logic of an application, which is the code that fulfills the purpose

of the application [79]. Enterprise beans can simplify the development of large, distributed

applications, because

• the EJB container is providing system-level services to enterprise beans, that means that

business developers can concentrate on solving business problems and do not have to

worry about low level system services;

• the beans and not the clients contain the application’s business logic, therefore the client

developers can focus on the client side and they do not have to code routines that imple-

ment business rules, which let clients become ”fat”;

• and the enterprise beans are portable and reusable components which allow to assemble

new applications from existing beans.

The current EJB specification (EJB 2.1) [80] defines three types of enterprise beans namely

session, entity, and message-driven beans [81].

Entity beansmodel real world objects which usually are persistent records in some kind of

Page 29: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 18

databases. They provide an object-oriented interface that makes it easier for developers to

create, modify, and delete data from the database. The two basic types of entity beans can be

distinguished by the way they are managing persistence.

• Container managed persistencebeans know how a bean’s instance persistence and rela-

tionship fields map to the database and automatically take care of inserting, updating, and

deleting data associated with entities in the database.

• Bean managed persistencebeans do all these work manually. The bean developer has to

take care of all data manipulations. The container is just responsible for telling the bean

when it is safe to insert, update, or delete data.

Session beansrepresent the workflow which describes all steps required to accomplish a specific

task. They are managing the interactions between entity beans as well as other session beans and

describe how they work together. Session beans are divided into two basic types distinguished

by their conversational state.

• Stateless sessionbeans do not maintain a conversational state for their clients. When a

client invokes a method of a stateless bean, the bean’s instance variables may contain a

state, but only for the duration of the invocation. As soon as the method is finished, the

state is no longer retained. Except during method invocation, all instances of a stateless

bean are equivalent, allowing an EJB container to assign any instance to any client. Since

stateless session beans are able to support multiple clients, they are offering very good

scalability for applications that require large numbers of clients. Stateless session beans

tend to be general-purpose and reusable, such as software services.

• Stateful sessionbeans can be seen as extensions of the client application. They are man-

aging a conversational state to the client and perform tasks on behalf of it specific to a

particular scenario. Stateful session beans are appropriate to use as soon as it is necessary

to hold information across method invocations. Compared to entity beans, the conversa-

tional state of stateful session beans is not persisted and will be lost as soon as the session

expires or will be closed.

A special kind of enterprise beans aremessage driven beans(MDB). MDBs, unlike session and

entity beans, are not accessed via component interfaces, but are invoked via asynchronous mes-

sages. They are stateless, server-side, transaction-aware components and act as Java Messaging

Page 30: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 19

Service (JMS) message listeners, which are conceptually similar to event listeners except that

they receive JMS messages instead of events.

TheJava Messaging Service[82] provides a way to exchange events and data asynchronously

which means that clients can send information to a messaging system and continue processing,

without having to wait for a response from the system. The messaging system is responsible for

the correct distribution of the messages. A standard way to access any messaging service is de-

scribed by the JMS API. Applications using JMS are called JMS clients whereas the messaging

system itself is the JMS provider. JMS clients able to send messages are entitled producers while

JMS clients able to receive messages are termed consumers. JMS clients can be both, producers

as well as consumers. The JMS API provides two types of messaging models or messaging do-

mains:publish-and-subscribeandpoint-to-point. The publish-and-subscribe model is intended

for a one-to-many broadcast of messages and is especially useful when consumers can choose

if they want to subscribe to a particular topic. In contrast, the intention of the point-to-point

model is a one-to-one delivery of messages, where each message has only one consumer.

Together with the EJB container architecture the JMS API allows the concurrent consumption

of messages and provides support for distributed transactions, so that database updates, message

processing, and connections to EIS systems using the J2EE Connector Architecture [83] can all

participate in the same transactional context [82]. The MDB itself is responsible for process-

ing the messages it receives whereas its container takes care about managing the component’s

environment including transactions, security, resources, concurrency, and message acknowl-

edgment. MDBs support both, the publish-and-subscribe as well as the point-to-point domain

model and can also act as senders for new messages. Like stateless session beans, MDB do not

hold a conversational state.

Java ServerPages

TheJavaServer Pages(JSP) technology enables developers to rapidly develop and easily main-

tain, information-rich, dynamic Web pages that leverage existing enterprise systems. The JSP

technology is an high level extension of theJava Servlettechnology [84] enabling developers to

embed Java code in HTML pages. Servlets are platform-independent, server-side modules that

can be used to extend the capabilities of a Web server with minimal overhead, maintenance, and

support. Unlike other scripting languages, servlets are application components that are down-

loaded, on demand, to the part of the system that needs them and involve no platform-specific

consideration or modifications.

Page 31: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 20

A typical JSP page is a text document containing two types of tags: static data, which can be

expressed in any text-based format (such as HTML, XML, etc.), and JSP elements, which con-

struct dynamic content [85].

The Jakarta Strutsproject [86] provides a powerful open-source framework for developing

J2EE Web applications and uses both JSP- as well as Java Servlet technology. Struts forces the

implementation of the Model-View-Controller (MVC) software architecture [87,88,89], which

separates a Web application’s data model, user interface, and control logic into three distinct

components. The consequences using this architecture are that modifications to the view com-

ponent can be made with minimal impact to the data model component. This is especially useful

since models typically are quite stable, whereas user interface code usually undergoes frequent

and sometimes dramatic change. Separating the view from the model makes the model more

robust, because the developer is less likely to ”break” the model while reworking the view. The

general workflow of the MVC pattern is as follows:

1. user interacts with the user interface in some way;

2. controller (a Java servlet) receives notification of the user action from the user interface

objects;

3. controller accesses the model, possibly updating it in a way appropriate to the user’s

action;

4. controller delegates to the view for presentation;

5. view uses the model to generate an appropriate user interface;

6. user interface waits for further user interactions.

Web Services

Web services are a collection of protocols and standards used to exchange data between appli-

cations in a platform agnostic way. In bioinformatics applications they are getting more and

more important since software written in various programming languages and running on var-

ious platforms can use them to exchange data over computer networks like the Internet [34].

Web services are self-contained, self-describing, modular applications that can be published,

located, and invoked across the Web [90]. Usually they are communicating using HTTP and

XML and interact with other Web services using standards like Simple Object Access Protocol

Page 32: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 21

(SOAP) [91], Web Service Description Language (WSDL) [92], and Universal Description Dis-

covery and Integration (UDDI) [93] services, which are supported by major software suppliers.

These standards enable Web services to achieve wide acceptance and interoperability [94].

Service providers and service requesters must be able to communicate with each other. This

communication requires the use of a common terminology or language. In case of Web ser-

vices, the language of choice isXML. A common language facilitates communication between

the two partners, as each party is able to read and understand the exchanged information based

on the embedded markup tags. Although providers and requesters can communicate using inter-

preters or translators, this is fairly impractical because such intermediary agents are inefficient

and not cost effective.

The establishmend of a common language is important, but it is not sufficient for two parties

to properly communicate. Effective communication can take place when both parties are able

to exchange messages according to an agreed-upon format.SOAPprovides such a common

message format for Web services and enables unknown parties to communicate effectively with

each other.

SOAP is a text-based protocol that uses an XML-based data encoding format and HTTP/SMTP

to transport messages. It is independent of both the programming language and the operational

platform, and it does not require any specific technology at its endpoints. This makes SOAP

completely independent from vendors, platforms, and technologies. Furthermore because of

the text format it is also possible to tunnel SOAP messages through firewalls using HTTP. The

general SOAP message structure is depicted in figure2.7(a).

Additionally to a common message format and markup language, there must be a common for-

mat that all service providers can use to specify service details, like service type, how to access

the service, and so on. Such mechanisms enable providers to specify their services so that any

requestor can understand and use them.WSDL, the Web Service Description Language, pro-

vides this common specification format for Web services. It specifies a grammar that describes

Web services as a collection of communication endpoints or ports. The data being exchanged

are specified as part of messages. Every type of action allowed at an endpoint is considered

an operation. Collections of operations possible on an endpoint are grouped together into port

types. The messages, operations, and port types are all abstract definitions, which means the

definitions do not carry deployment-specific details to enable their reuse. The protocol and data

format specifications for a particular port type are specified as a binding. A port is defined by

associating a network address with a reusable binding, and a collection of ports define a service.

Page 33: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 22

In addition, WSDL specifies a common binding mechanism to bring together all protocol and

data formats with an abstract message, operation, or endpoint (figure2.7(b)).

UDDI provides a way for service requesters to look up and obtain details of a service similar

to a telephone system’s yellow pages. This is accomplished by having common, well-known

locations where providers can register their service specifications and where requesters know

to go to find services. By having these common, well-known locations and a standard way to

access them, services can be universally accessed by all providers and requesters (figure2.7(c)).

Java Management Extension

For application servers as well as other applications which are running for a long time without

interruption it is sometimes necessary to get some status information or to change parameters

at run-time. For such management tasks it is therefore not appropriate to shut down a complete

application server and thus all the running applications. The Java Management Extensions

(JMX) [95] provide a framework which facilitates the management of such long-running appli-

cations. It defines an architecture, the design patterns, the APIs, and the services for application

and network management and monitoring in the Java programming language [96].

The JMX specification [97] divides the management-architecture into three distinct layers (see

figure2.8):

• The instrumentation levelprovides a specification for implementing JMX manageable

resources. These can be an application, an implementation of a service, a device, etc.

The instrumentation of a given resource is provided by one or more Managed Beans, or

MBeans, that are eitherstandardor dynamic. Standard MBeans are Java objects that con-

form to certain design patterns derived from the JavaBeans component model. Dynamic

MBeans conform to a specific interface that offers more flexibility at runtime. Both stan-

dard and dynamic MBeans are designed to be flexible, simple, and easy to implement.

Additionally, the instrumentation level also specifies a notification mechanism, which al-

lows MBeans to generate and propagate notification events to components of the other

levels.

• Theagent levelprovides a specification for implementing management agents, which di-

rectly control the resources and make them available to remote management applications.

It builds upon and makes use of the instrumentation level, to define a standardized agent

Page 34: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 23

(a) The structure of a SOAP message. The SOAPenvelope defines the content of the message. Theoptional header entries may contain information ofuse to recipients. The body contains the messagecontents, which are consumed by the recipient. Fur-thermore a SOAP message can contain a unlimitednumber of attachments.

(b) The typical structure of a WSDL document.Shown at the top are the sections that allow to de-scribe a service interface in an abstract way. Thesections at the bottom describe how and where toaccess the concrete service that implements the ab-stract interface.

(c) The role of a registry in a Web service. A serviceprovider registers its services at a UDDI Registry.A requestor looks for a special service at the UDDIregistry, finds it and gets information to bind to theWeb service provider.

Figure 2.7: Web services

Page 35: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 24

Figure 2.8: Relationship between the components of the JMX-architecture within the differentlevels (adapted from [97]). The instrumentation level provides the specification for JMX man-ageable resources (MBeans). The agent level contains the MBean server as well as a set ofservices to handle MBeans. The distributed services level defines management interfaces andcomponents that can operate on agents.

to manage JMX manageable resources. A JMX agent consists of a MBean server and a

set of services for handling MBeans. The MBean server implementation and the agent

services are mandatory in an implementation of the specification.

• The distributed services levelprovides the interfaces for implementing JMX managers.

This level defines management interfaces and components that can operate on agents or

hierarchies of agents. They can provide a) interfaces for management applications to

interact transparently with a JMX agent or its resources, b) distribute management infor-

mation from high-level management platforms to numerous JMX agents, c) consolidate

management information coming from numerous JMX agents into logical views that are

relevant to the end user’s business operations, d) or simply provide security. Manage-

ment components cooperate with one another across the network to provide distributed,

scalable management functions.

2.2.2 Relational Databases

Relational databases are rested upon the theory of relational mathematics based on the set the-

ory [98]. The basic idea behind relational database models is a collection of two-dimensional

tables, linked among themselves by different keys. Real world objects are mapped by abstract

Page 36: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 25

entities which are represented by their according tables. Tables are storing information about

instances of entities, their attributes and their relations among each others. Every entity instance

consisting of a unique record (tuple) of data represents a row in the table. The uniqueness of

each data record is based on a well-defined primary key (PK) whose occurrence must be unique

within each table. To realize one to many (1:N) or many to one (N:1) interrelations, data records

must contain primary keys of foreign tables, called foreign keys (FK). An implementation of

many to many (N:N) relations requires an additional table storing explicit allocations of differ-

ent foreign keys. The fact of defined relations allows the stored data to be broken down into

smaller logical and easier maintainable units.

The usage of relational database management systems (DBMS) leads to the following advan-

tages:

• Reduction of duplicate data: Leads to improved data integrity.

• Data independence: Data can be thought of as being stored in tables regardless of the

physical storage.

• Application independence: The database is independent of the systems and programs

which are accessing it.

• Concurrency: Data sharing with a multiplicity of users.

• Complex queries: Single queries may retrieve data from more than just one table.

To create, modify, and query databases, the Structured Query Language (SQL) [99] has been

developed. Although SQL has been standardized by the American National Standards Institute

(ANSI) in 1986, many dialects have evolved. The command set of SQL can be divided into the

following three sections:

• TheData Definition Language (DDL)allows to define new tables, to alter and delete them

and the associated elements like indexes or constraints.

• The Data Manipulation Language (DML)is used to query a database in order to add,

update and delete data.

• TheData Control Language (DCL)handles authorization aspects of data and permits the

user to control who has access to see or manipulate data within the database.

Page 37: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 26

2.2.3 Design Patterns

Patterns have been around for a long time. Although there is no generally accepted definition

of patterns in the informatics community, maybe the best one is coming from the architect

Christopher Alexander: ”Each pattern describes a problem which occurs over and over again in

our environment, and then describes the core of the solution to that problem, in such a way that

you can use this solution a million times over, without ever doing it the same way twice” [75,

100]. Although Alexander was talking about patterns in buildings the same is true about object-

oriented design patterns. They are describing proven solutions to recurring design problems and

generally consist of the following elements:

• The pattern nameis a handle that can be used to describe a problem, its solutions, and

consequences in a word or two.

• Theproblemdescribes when to apply a particular pattern.

• Thesolutiondescribes the elements that make up the design, their relationships, respon-

sibilities, and collaborations.

• Theconsequencesfinally are the results and possible trade-offs of applying the pattern.

Although software design patterns have been proven and are reusable, it is important to remem-

ber that using patterns does not guarantee success, per se. Figure2.9 describes the core J2EE

design patterns and their relationship to each other. From this catalog, the following patterns

have been extensively used in this thesis:

• Transfer Object

• Session Facade

• Data Access Object

• Value List Handler

TheTransfer Objector often calledValue Objectpattern belongs to the category of business-tier

patterns [75]. As soon as one is working with remote calls, each call is very expensive since

data may have to be marshaled, security may need to be checked, packets may need to be routed

through switches, etc. Therefore the idea is to reduce the number of calls with the consequence

to increase the amount of data which has to be transfered with each call. A Transfer Object

Page 38: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 27

Figure 2.9: The core J2EE patterns catalog [101]. These list of patterns can be divided into thefollowing categories:Presentation-Tier, Business-Tier, andData-TierPatterns.

Page 39: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 28

can hold all the data for one call. Usually Transfer Objects are assembled by aTransfer Object

Assemblerand are serializable objects which can go across network connections [85].

TheSession Facadeis a coarse-grained facade on fine-grained objects to improve efficiency of

a network which also can contain business logic. It tackles the distribution problem with the

standard object-oriented approach of separating distinct responsibilities into different objects.

Fine-grained objects are the right answer for a complex logic. The session facade uses this

fine-grained objects and they are collaborating within a single process and perform workflow

logic. A detailed description of the session facade pattern and the resulting consequences can

be found in [85].

TheData Access Object (DAO)pattern encapsulates all access to a data source and it manages

the connection with the data source to obtain and store data.

The problem is that many real-world applications need to use persistent data at some point. For

many applications, persistent storage is implemented with different mechanisms, and there are

marked differences in the APIs used to access these different persistent storage mechanisms.

Such disparate data sources offer challenges to the applications and can potentially create a di-

rect dependency between application code and data access code. Including the connectivity and

data access code within business components like entity beans, session beans, and even pre-

sentation components such as servlets and helper objects for JSPs introduces a tight coupling

between the components and the data source implementation. The resulting code dependencies

in these components make it difficult and tedious to migrate the application from one type of

data source to another. When the data source changes, the components need to be changed to

handle the new type of data source.

The DAO implements an access mechanism required to work with disparate data sources.

It completely hides the data source implementation details from its clients and provides a

much simpler interface to them. Because this interface does not change when the underlying

data source implementation changes, this pattern allows the DAO to adapt to different storage

schemes without affecting its clients or business components. Essentially, the DAO acts as an

adapter between the component and the data source. When the underlying storage is subject

to change from one implementation to another, this DAO strategy may be implemented using

theAbstract Factorypattern described in [100]. In this case, this strategy provides an abstract

DAO factory object that can construct various types of concrete DAO factories, each factory

supporting a different type of persistent storage implementation. Once a concrete DAO factory

is obtained for a specific implementation, it can be used to produce supported DAOs (figure

Page 40: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 29

2.10(a)and2.10(b)). The consequences of using the DAO pattern are:

• Enhanced transparency, since the clients do not need to know specific details of a data

source.

• Easier migrationfrom one datasource to another since clients are not aware of the under-

laying details and it is just necessary to adapt the DAO layer.

• Centralization of all data access in one layer, which makes an application more main-

tainable.

TheValue List Handlerpattern is useful to control the search, cache the results, and provide the

results to the client in a result set whose size and traversal meets the client’s requirements.

It is applicable when applications are required to search and list certain data. In some cases,

such a search and query operation could yield results that can be quite large. Therefore it is

impractical to return the full result set when the client’s requirements are to traverse the results,

rather than process the complete set. Typically, a client uses the results of a query for read-only

purposes, such as displaying the result list. Often, the client views only the first few records,

and then may discard the remaining records and attempt a new query. This search activity often

does not involve an immediate transaction on the matching objects. The practice of getting a

list of values represented in entity beans by calling anejbFind() method, which returns a

collection of remote objects, and then calling each entity bean to get the value, is very network

expensive and is considered bad practice.

Therefore, the solution is to use the value list handler pattern. This pattern creates a handler

which controls query functionality and results caching. The handler directly accesses a DAO

that can execute the required query. Results obtained from the query as a collection of Transfer

Objects are afterwards stored in the handler. The client requests the handler to provide the query

results as needed. The handler generally implements anIterator pattern [100] to provide the

results (see figure2.11(a)and2.11(b)).

The usage of the Value List Handler pattern leads to the following consequences:

• It provides a perfectalternative to EJB finders for large queriessince the EJB overhead

is not involved.

• The query results can becached on server side.

• Providesbetter querying flexibilitysince the Enterprise Java Beans Query Language (EJB-

QL) is partly impractical.

Page 41: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 30

(a) Class Diagram

(b) Sequence Diagram

Figure 2.10: The Data Access Object pattern.

Page 42: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 31

(a) Class diagram

(b) Sequence diagram

Figure 2.11: The Value List Handler pattern.

Page 43: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 32

• May improve network performancesince only requested data is shipped to the client on

an as-needed basis.

2.2.4 Model Driven Architecture

The generation of source-code to increase efficiency is a widely accepted concept in today’s

software development projects. Generally, the code generation process is a systematic concre-

tion of an abstract specification which allows an automatic transformation to a concrete target.

This means, that the specification of the operation of a system is separated from the details of

the way that a system uses the capabilities of its platform. Platform, in this context, means

something like .NET [102] or J2EE [76]. This principal is well tried in informatics, since each

high-level programming language is based on this paradigm. Automatic code generation is

just residing a level above. The primary goals of the Model Driven Architecture (MDA) [103]

approach are portability, interoperability and reusability through architectural separation of con-

cerns [104].

Ideally MDA generators like AndroMDA [105] build full applications, constructed from a set

of interlocking components, from one or more input models. One of the important concepts

in MDA is the idea of multiple levels of models. This starts at the highest level with the plat-

form independent model (PIM), which means a model free from any system specifics, like bean

specifications (in the case of J2EE). The PIM specifies and maps entities in the most abstract

form possible which still captures all of the requirements. The next level down is the platform

specific model (PSM). This model is an augmented version of the PIM which adds all of the

structures required to implement the PIM, but still modeled in a UML [106] form. Genera-

tors use the PSM to create the output code (see figure2.12). Between the different models are

transformations, which are defined by transformation rules. MDA prescribes the existence of

transformation rules, but it’s up to the user to define what those rules are.

Figure 2.12: MDA generates code applying transformations to the Platform Independent- andPlatform Specific model using templates.

The differentiation between the PIM and PSM is a critical piece of the MDA design for gener-

ation. The layer between the model and the templates handles all of the mapping between the

Page 44: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2.2. COMPUTATIONAL METHODS 33

high-level model and the implementation specifics of the system. This means that the templates

can be free to concentrate on rendering code instead of handling both the rendering and the

interpretation of the model. This separation of concerns in the MDA model can be linked to

the architectural model in a 3-tier Web server. The concerns of the independent model are held

separate from the platform specific model as the concerns of the business model are held sepa-

rate for the user interface. This multi-layer approach is standard software engineering factoring

which makes it easier to abstract the business model and to produce cleaner code. The attempt

to put all transformations in a single layer would lead to more system related information in the

abstract model and thus finally lead to poor quality of the generated code.

2.2.5 Authentication and Authorization

The development of multi-user systems requires sophisticated authentication and authorization

systems especially when sensitive patient related data has to be stored. Authentication normally

is a prerequisite to authorization, but the concepts are different [107,85].

Authenticationis the process of determining whether someone or something is, in fact, who

or what it is declared to be. In private and public computer networks (including the Internet),

authentication is commonly done through the use of logon passwords. Knowledge of the pass-

word is assumed to guarantee that the user is authentic. Besides username/password combina-

tions other principles like digital certificates, keycards, or even more sophisticated fingerprints

or retina scans are evolving.

Authorization, in contrast, is the process of giving someone permission to do or have something.

In multi-user computer system environments, typically a system administrator defines for the

system which users are allowed access to the system and what privileges of use to resources

like access to which file directories, Web sites, etc., they have. This is achieved by assigning

controls such as read, write, delete to this resources. The controls along with the users or groups

are typically stored inAccess Control Lists (ACL).

Page 45: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Chapter 3

Results

3.1 Overview

We have developed a bioinformatics platform for cancer immunogenomics based on the Mi-

croarray Analysis and Retrieval System (MARS) and the Tumoral Microenvironment Database

(TME.db). Both MARS as well as TME.db have the following architectural and software spe-

cific components in common: They are Web based applications built on a 3-tier architecture

implemented in J2EE technology (see2.2.1) and accessible via a standard Web browser of the

lastest generation (e.g. IE5, Netscape 7, Mozilla Firefox). Furthermore both applications use

the open-source application server JBoss [108] with the integrated Web container Tomcat [109].

This combination provides very good performance, stability, and is freely available. The pre-

sentation logic has been developed using Jakarta Struts (see2.2.1) which facilitates the imple-

mentation of the Model View Controller (MVC) pattern. Currently, an Oracle [110] relational

Database Management System (DBMS) serves as Enterprise Information System (EIS), but this

can be changed to another DBMS, depending on the requirements as well as the availabilities

due to the used application architecture. To provide both applications with a user management

system, the development and subsequent integration of a sophisticated Authentication and Au-

thorization System (AAS) has been initiated [107]. This AAS integrates a central management

for all users, applications and the application based user access levels. Administration can be

conducted using a Web based environment either for all or for specific applications. Further-

more using this central AAS enables single sign-on and therefore application overlapping au-

thentication and subsequently authorization and access to the resources of multiple applications.

The next sections provide more detailed information on the developed systems.

34

Page 46: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 35

3.2 Microarray Analysis and Retrieval System (MARS)

The objective was to develop a Web based, MIAME [37] compliant microarray database that

enables the storage, retrieval, and analysis of multi-color microarray experiment data produced

by multiple users and institutions. Furthermore well established applications should be able to

connect to MARS through either Web services or APIs to query the data, perform high-level

analysis, and finally write back the transformed data sets.

After elaborating the typical microarray workflow starting from the microarray production to the

image analysis process, we have recorded all parameters necessary to comply to the proposed

MIAME standard as well as additional quality control parameters which are indispensable for

high-quality microarray data analysis. Furthermore already available microarray databases or

data repositories like BASE [111], TIGR Madam [112], RAD [113,114], or ArrayExpress [115]

have been evaluated. Their advantages as well as disadvantages have been taken into account

for the design of MARS. Starting from the elaborated microarray workflow (figure3.1) a re-

lational data model comprising all gathered parameters and possible interrelations has been

developed. The database schema as well as the Web based user interface are separated into

several related components, namely a)microarray productionincluding clone tracking, capture

probe production, and array production; b)sample preparationranging from sample annotation

via RNA extraction to the labeling; c)hybridizationstarting from pre-processing to raw data

processing; d)experiment annotationincluding the definition of experiment classes and the re-

lationship between the hybridizations; e) themanagementsection providing possibilities to up-

and download data, to define file formats and parsers, and to get status information; and f) an

administrationsection to manage users and providers of a particular institute.

3.2.1 Overview

Microarray Production

To address the needs of many laboratories which are producing their own microarrays, MARS

has included a sophisticatedArray Production LIMS(Laboratory Information Management Sys-

tem). This LIMS is dedicated to manage all data accumulating during the microarray production

process. Typically, all clones which will be spotted on arrays are available in microtiter plates.

Since these plates can contain different products such as bacteria, oligos, PCR products, and can

Page 47: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 36

Figure 3.1: Typical microarray workflow including production, experiment description, samplepreparation, hybridization, and image analysis.

Page 48: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 37

have a different number of wells (96, 384, 1536), different types of plates must be definable.

Furthermore, all these plates can undergo certain manipulations, i.e. events like PCR reactions

or PCR purifications, a possibility to trace back all modifications has to be provided. The flex-

ible design of the LIMS facilitates both without having to amend source code. It is possible to

assign a specific plate type to a plate. The plate type contains besides name and abbreviation,

well number, well type, and provider. To follow up plate modifications, plate events can be

used to describe them. Usually, plate events have attached a plate event type like PCR and an

event specific protocol which should guarantee reproducibility. Moreover, in order to support

user-defined pipetting schemas for the production of new plates from existing ones, arbitrary

plate merging algorithms can be defined and finally applied to the plates. Plate events and plate

merging ensures a parent-child relationship between two or more plates (figure3.2).

Having all necessary plates in the database, a spotting run can be set up. The output of a spot-

ting run is a file which includes a list of plates and their corresponding molecules used to create

an array design file by the spotting robot software. This file provides a reproducible and precise

mathematical map from spots on the microarrays to wells in the microtiter plates, and therefore

to the cDNA clones and the genes they represent. It has to be uploaded to MARS again in order

to create an array batch which assigns the array design to all arrays spotted during a certain

spotting run. Additionally, parameters regarding the spotting run such as temperature, duration,

or humidity can be added to each array batch for annotation purposes.

Laboratories using commercial arrays also have to upload the array design of their arrays pro-

vided by the manufacturer and to define an array batch afterwards.

Sample Preparation

A sample is the biological material (biomaterial), from which the nucleic acids have been ex-

tracted for subsequent labeling and hybridization. In this section all steps preceding the hy-

bridization with the array are enregistered. Usually it can be distinguished between the source

of the sample (e.g., organism, cell type or line), its treatment, the extract preparation, and its

labeling. MARS provides user definable ways to annotate the biological materials. Since an-

notations in free text fields would make them difficult to query, three different annotation types

are provided namely enumerations, numbers, and free text fields. Moreover annotations based

on these types can be either mandatory or optional and assigned to specific treatments or ma-

nipulations. Using this annotation mechanism samples can be described with a unprecedented

degree of accuracy. After entering, annotating, and possible pooling of samples, extracts have

Page 49: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 38

(a) Data model for plate handling.

(b) List of plates with the possibility to query. (c) Creating a new plate.

Figure 3.2: MARS plate handling

Page 50: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 39

to be created. Again, a lot of parameters such as used sample, extraction method, type, quantity,

or protocol can be recorded. The same is true for labeled extracts, where RNA samples are

labeled with different dyes.

Hybridization

Hybridization records the wet lab hybridization process by associating labeled extracts to the

arrays and describing the protocols and conditions during hybridization, blocking, and wash-

ing. To each hybridization any number of image sets can be attached. An image set consists

of the images resulting from scans using a single scanner setting plus the resulting raw data

sets. In contrast to already available microarray databases, MARS allows to store multi-color

hybridizations, which means that the number of labeled extracts used for single hybridizations

can be greater than two. The storage of raw data sets for multi-color hybridizations is achieved

by defining a common set of data and a channel dependent set of data. The common set includes

information on the reporters, their specific position on the array and a false-color image thumb-

nail whereas the channel dependent data deals with the measured parameters like foreground

and background values, their standard deviations, and so on. Specific information on the image

analysis process can be registered, too (figure3.3).

Experiment Annotation

Experiment annotation and the definition of the experimental design is crucial for further pro-

cessing like normalization and cluster analysis. In MARS, an experiment consists of several

sub-experiments. A sub-experiment is a collection of hybridizations to investigate particular

questions of interest. The annotation of the experiments can be conducted using the MGED

Ontology [116], which is a large set of controlled vocabularies and ontologies [117] that can

be used to describe biological samples, their manipulations as well as the experimental de-

sign. It is possible to define perturbational, methodological, epidemiological design, biological

properties, and biomolecular annotations. Exact experimental design and the relations within

hybridizations can be described by defining experiment classes, which correspond to biological

conditions, and associated raw data sets [118].

Page 51: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 40

(a) Data model for handling multi-color rawbioassay data.

(b) Detailed view on the results of a hybridization.

Figure 3.3: MARS rawbioassay data

Page 52: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 41

Management

The idea of the management section in MARS was to provide a single point where all file

uploads to the database can take place. In order to avoid storing all data somewhere on a

fileserver generated data should be immediately uploaded to MARS. Once stored in the upload

area data can be processed from any computer with an Internet connection and the required

fields can be filled subsequently. Besides the upload area, the management section of MARS

enables the definition of file formats, which are required to parse files containing information

regarding plates, array designs, raw- and transformed data sets. Furthermore it is possible to get

status information of long running tasks that are started asynchronously to enable users to go

on working without having to wait until the process has finished.

Administration

The administration section allows the management and creation of new users as well as the ad-

ministration of providers which are delivering consumables or provide samples, commercially

available arrays, and so forth.

3.2.2 Asynchronous Tasks

Asynchronous tasks using Message Driven Beans (MDB) have been integrated to enable users

to continue their work without having to wait until long running tasks have finished. Among

these are the upload of plates, array designs, raw- or transformed data sets, the creating of

spotting runs as well as the generation of raw data results files which can be downloaded and

further processed with other software packages. The generation of MAGE-ML [119] files,

which are used to deposit microarray data in public repositories like ArrayExpress [115] or

Gene Expression Omnibus (GEO) [120], also belongs to this kind of functions. The benefits

of this implementation approach are twofold, first it enables users to continue working and

second it facilitates the distribution of computational intensive and memory critical processes.

Computational intensive processes which require less memory should be able to use all CPUs

on a multi-processor machine. Therefore the number of processors can be equal to the number

of tasks performed in parallel. In contrast, memory critical processes should run one by one

in order to be able to use the whole memory available. MARS takes this into account and

distributes the MDBs according to the aforementioned requirements. In order to keep users

informed about the progress of their tasks, the current status can be displayed upon requests

Page 53: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 42

(figure3.4).

Figure 3.4: Status information page showing queued, currently processed, finished tasks as wellas possible failures

3.2.3 Report Beans

’Report Beans’ have been introduced to facilitate reporting, querying as well as sorting of huge

data sets. Queries often yield quite large results with several thousand data sets. Due to perfor-

mance reasons it is impractical to return all datasets to the client particularly when the client’s

requirements are just to process a snippet of them. Report beans are designed to process such

requests and respond for each request just a subset of the results to the client, which can sub-

sequently browse through them. A report bean can be seen as an extension to the client and is

therefore implemented as a stateful session bean (see2.2.1) because the conversational state of

the bean has to be conserved during a particular session. Besides making extensively use of the

Value List Handler pattern (see2.2.3), all report beans implement a common interface which

describes the required operations likegetOperators() , getSearchableFields() ,

executeQuery() , getPage() , or getNextElements() . Using this common inter-

face allowed the development of a single JSP, which can be included into various report pages,

to access beans for different reports. The participating class and their relationships are depicted

in figure3.5. The general workflow to fullfill a request to a report bean can be described in the

following way (figure3.6):

1. client requests a report session bean (step 1)

Page 54: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 43

Figure 3.5: Class diagram of the ’Report Beans’

2. report session bean creates all underlaying objects and finally gets a reference to the data

access object (DAO) (step 2-6)

3. client callsexecuteQuery() (step 7)

4. report bean delegates the call to the DAO which generates the SQL statement (step 8)

5. report bean executes the SQL statement and receives the resulting data, which in fact is a

list of all primary keys (PK) ordered by the field selected for sorting (step 9-11)

6. client callsgetNextElements() (step 12)

7. report bean delegates this call to the Transfer Object Assembler that collects all necessary

data and fills a Transfer Object (step 13-18)

8. the resulting Transfer Object is returned to the report bean (step 19)

9. client receives a list of Transfer Objects dependent on the amount of data which can

currently presented to the user (step 20)

Page 55: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 44

Figure 3.6: Sequence diagram of the ’Report Beans’

Page 56: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.2. MARS 45

The consequences of the presented approach of report beans are as follows:

• Users can compose queries based on a predefined set of fields and different operators.

• Results can be sorted using this fields.

• Queries are executed directly in the database and therefore no unnecessary network traffic

to the middle tier and especially to the client is caused.

• Because of the common interface provided, application developers can reuse already de-

veloped clients and just have to perform minor changes in the client application.

3.2.4 Web Services

To provide users access to MARS with software they are familiar with (e.g. BioConduc-

tor [121], Matlab [122]), MARS provides a well defined SOAP interface and MARSExplorer,

a Java API software developers can use to extend their Java based programs with data access

functionality. Incorporating these interfaces in already existing software allows after minor

adaptions to authenticate against MARS, to browse datasets, filter and download data, and fi-

nally upload transformed data. To reveal the applicability of the Java API, we have integrated

MARSExplorer into ArrayNorm [118], a software toolkit for microarray normalization. MAR-

SExplorer enables ArrayNorm to connect to MARS via the provided Web service interface (see

figure3.7). After a successful connection has been established, it is possible to browse through

experiments, sub-experiments and the assigned experiment classes and to get information on

them. A particular sub-experiment including all attached rawdata can be downloaded just be

selecting the sub-experiment and clicking on the download button. After normalizing the ex-

periment, the transformed data sets can be transfered to MARS where they are stored for further

usage.

3.2.5 MARS Management using JMX

The basic idea of the Java Management Extension (JMX) is to manage applications which are

running for a long time without interruption. Therefore aServerSettings-MBean, based on

the standard MBean interface (see2.2.1), managing basic mechanisms has been integrated in

MARS. Basically it is responsible a) for providing the identification to the unterlaying Enter-

prise Information System (EIS), b) for enabling or disabling guest user accounts, and c) for

Page 57: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.3. TME.DB 46

Figure 3.7: ArrayNorm using MARSExplorer to browse, download, and upload microarrayexperiment data.

setting the path to the MARS data root directory, which is a part of the file system exclusively

used for MARS and its data. Under this directory MARS has a defined file structure which

primarily consists of the user’s userID and the name of the section to which a particular file

belongs. Symbolically the file structure looks like

/MARSDATAROOTDIRECTORY/USERID/SECTIONNAME/file.

Furthermore additional parameters can be added to manage MARS without extra efforts. The

management is done via theJMX-consoleprovided by JBoss.

3.3 Tumoral Microenvironment Database (TME.db)

Since complex data management systems can facilitate the identification of the key components

and the discovery of tiny differences between certain types of cancer and therefore improve di-

agnosis and treatment [27,28], we have initiated the development of TME.db. TME.db aims to

integrate clinical data sets and results from high-troughput screening technologies with special

emphasis on flow cytometry analysis. Its intention is to provide researchers access to data par-

ticularly relevant in tumor biology and immunology, diagnosis management and the treatment

of cancer patients. The key components of this system are a) thepatient informationsystem

Page 58: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.3. TME.DB 47

including the medical history, 2) the patient relatedexperiment managementsystem which han-

dles data from high-troughput screening technologies like FACS phenotype and proliferation

analysis, 3) thesample material management, 4) the sophisticatedquery tools, and finally 5)

anadministrationsection. Since TME.db records sensitive patient related data, several security

mechanisms have been applied.

TME.db has been developed using Model Driven Architecture (MDA) (see section2.2.4) with

AndroMDA [105] and templates and cartridges specially designed for developing J2EE appli-

cations [123].

3.3.1 Overview

Patient Information

The patient information component is the core of the whole system. It enables the storage of

patient related data including name, gender, age, and several IDs to be able to identify a particu-

lar patient unambiguously in a clinical information system. Furthermore additional data can be

stored to each patient. These data range from personal problems to detailed information on the

medical history as well as on surgery and cancer treatment (figure3.8). Additionally, cancer can

be described in detail including cancer type, subtype, and localization. It is possible to assign

each cancer a particular stage based on theTNM- [124] or Dukes[125] cancer staging schema,

which is especially important since staging is the process of finding out how much cancer there

is in the body and where it is located. This information can be used to plan appropriate treatment

and to help determine the prognosis. A Patientcard can be displayed and printed to give a brief

overview of all clinical data and the experiments conducted for a particular patient. Information

and data which are restricted to particular users will not be displayed depending on the access

controls (see3.3.2).

Page 59: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.3. TME.DB 48

(a) Detailed view on a particular patient (b) Personal problems

(c) Medical history (d) Cancer information

(e) Surgical information (f) Patientcard

Figure 3.8: TME.db patient information

Page 60: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.3. TME.DB 49

Experiment Management

All information regarding experiments conducted for a certain patient is stored in the experi-

ment management. An experiment, in terms of TME.db, combines phenotype and proliferation

analyses using FACS, biological marker investigations as well as cytokine secretion assays (fig-

ure3.9). Typically such an experiment is annotated with a description and an experiment date

(a) Experiment overview (b) Detailed view on proliferation data

(c) Detailed view on phenotypical data (d) Detailed view on biological markers

Figure 3.9: TME.db experiment management

and is linked to the used sample material and the relevant cancer information. In order to fill the

different sections with information it is either possible to upload or to edit the data manually.

Page 61: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.3. TME.DB 50

Although it is necessary to provide the possibility to edit each data set manually, uploading of

data is particularly preferable where a lot of different data points have to be processed at once

or are coming from machines, e.g. phenotype and proliferation data coming from FACS.

Sample Material Management

This section allows to manage all data regarding the sample materials gathered from a patient.

Besides the detailed description of the material, it is possible to track all treatments a sample is

passing through between biopsy and final analysis. These include separation, cell purification

and extraction steps. Furthermore it is possible to manage the resulting tubes and their storage

positions (figure3.10).

(a) Sample treatment overview (b) Detailed view on a tube stored in TME

Figure 3.10: TME.db sample management

Queries

In order to query the database, sophisticated query pages have been added. Basic queries are

used to get information about the patients or donors or to retrieve an overview of the experi-

ments. Additionally, a sophisticated custom query mechanism has been implemented. Within

these custom query pages, it is possible to combine almost all of the available fields of the

database in order to find exactly this piece of information one is looking for. Custom query

pages are divided into two main sections: the listings part provides lists on cancer data, exper-

Page 62: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.3. TME.DB 51

imental data, and sample material data and can be used to restrict the number of patients used

for the query part, which returns data that can be further processed (figure3.11). Furthermore,

the parameters selected or entered into the custom query pages can be stored and reused later.

(a) Query page showing cancer information fields (b) List of cancer data

Figure 3.11: TME.db custom queries

Administration

The administration section allows TME administrators to add and edit users, hospitals, and

institutes. Each TME user belongs to one or more user groups and can view or edit information

on patients of particular hospitals (see3.3.2). Furthermore, the gate-antigen conditions used

for the flow cytometry experiments can be defined as well as the expected expression and mean

fluorescence intensity ranges corresponding to these conditions.

3.3.2 Security

Since TME.db integrates clinical data sets it is crucial to pay attention on security as well as on

the resulting ethical, legal, and social implications (ELSI). Therefore several mechanisms have

been applied to ensure data confidentiality. TME.db is placed behind a firewall. This firewall

allows just users from specific IP addresses to visit the Web site of TME.db which is secured

by HTTPS. To login, these users have to provide a username/password combination. In order

to ensure secure passwords they have to be in accordance with several rules:

Page 63: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.3. TME.DB 52

• The password has to contain at least one special character and a number.

• It must be longer than 8 characters.

• Password have to be changed at least once a month.

• Already used passwords are kept and can not be used until passwords have changed for

at least 4 times.

Each successful login and login attempt is recorded by the underlaying Authentication and

Authorization System (AAS) [107] and TME.db. If a username/password combination has

been entered wrong for three times, the AAS locks the account and informs the administrator.

TME.db logs each successful login in order to keep administrators informed who had access

to the database. Since TME.db stores patients medicated in different hospitals a mechanisms

ensuring that only users which are allowed to see patients from a particular hospital have access

to their data. Furthermore, using the Access Controls (see2.2.5) allows to hide sensitive data

from certain users as well as to restrict special operations. In order that nobody is able to obtain

sensitive information directly from the data source, all sensitive information is stored encrypted

using a password based encryption [126].

3.3.3 Integration of MARS and TME.db

TME.db aims to integrate clinical data sets and results from high-throughput screening methods.

Since we have developed two separate databases where TME.db is in particular responsible for

managing clinical data and data coming from FACS experiments, we have connected MARS to

TME.db in order to integrate gene expression profiling data as well. The link between TME.db

and MARS is based on Java Remote Method Invocation (RMI). Although Java RMI is limited

to Java applications, it can be used for the current server and application configuration, where

both applications are using the same technology, are on the same subnet, and behind a firewall.

To add hybridizations to TME.db, first users have to insert the relevant hybridization data into

MARS. After creating an experiment in TME.db, users can browse through their hybridizations

stored in MARS and select them to be linked to the relevant TME experiment (figure3.12).

Furthermore, the integrated AAS enables through the single sign-on feature to browse in detail

through the microarray data.

Page 64: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.4. MICROARRAY ANALYSIS 53

(a) TME.db: click to add hybridizations (b) TME.db: select corresponding hybridizations

(c) MARS: detailed view on hybridization UHC-2:CD3/CD28

(d) TME.db: list of matching hybridizations to aspecific experiment

Figure 3.12: Integration of TME.db and MARS

3.4 Analysis of Microarray Data

In order to evaluate the approach using microarrays to classify patients suffering from can-

cer, a preliminary study using commercial arrays from Agilent Technologies has been initiated.

Therefore, 18 hybridizations using RNA from lung cancer patients and healthy donors have

been conducted (table3.1). In addition to the untreated samples, samples from 6 healthy donors

have been stimulated with a cocktail of CD3, CD28, IL-2, and/or IL-10 to induce or inhibit

Page 65: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.4. MICROARRAY ANALYSIS 54

Subjects Stimulation Reference Number ofhybridizations

Normals Not stimulated Jurkat T-cells 6Normals CD3/CD28 Jurkat T-cells 3Normals CD3/CD28/IL-10 Jurkat T-cells 2Normals CD3/CD28/IL-2 Jurkat T-cells 1

Patients with lungcancer

Not stimulated Jurkat T-cells 5

Jurkat T-cells Not stimulated Jurkat T-cells 1

Table 3.1: Overview of conducted microarray hybridizations.

proliferation of T-cells (table3.2).

All hybridizations have been analyzed using Agilent’s Feature Extraction Software (Agilent

Technologies). This software package has also been used for within-slide normalization using

LOWESS [127], a method which aims to reduce intensity-dependent effects (e.g. dye effects).

Afterwards the quality of the hybridizations has been checked with ArrayNorm [118] using

regression and MA-plots [51] (figure 3.13). Although ArrayNorm provides other possibilities

to check the slide quality, some methods like Box-plot to detect spatial effects could not be

used since the provided slides basically consist of a single block (subgrid) and therefore this

method fails. Finally, to remove array effects, ArrayNorm has been used to normalize between

the slides. In order to verify the reproducibility of the slides, a hybridization has been repli-

cated using RNA from the same samples. The result of the quality measurement revealed a

very good correlation of the replicated slides, which is also confirmed by the result of the re-

gression analysis with a value ofR2 = 0.9784 (figure 3.13(c)). To check the reproducibility

of duplicated spots on the arrays, hierarchical cluster analysis (HCL) has been performed on 8

hybridizations (3 donors vs. 5 patients) containing 356 genes with 13 duplicates. Duplicated

Stimulation EffectCD3/CD28 Activates the cells from their T-cell receptor (TCR). CD28 brings the costimula-

tion which is necessary as a second signal to induce T-cell activation and prolif-eration. This stimulation mimics an Antigen stimulation.

CD3/CD28/IL-2 The addition of IL-2 boost the activation and induces the proliferation of the T-cells.

CD3/CD28/IL-10 IL-10 is an immunosuppressive cytokine which inhibits T-cell activation and pro-liferation induced by CD3/CD28

Table 3.2: Effects on T-cells induced by stimulation with either CD3/CD28, CD3/CD28/IL-2,or CD3/CD28/IL-10.

Page 66: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.4. MICROARRAY ANALYSIS 55

(a) MA-plot from slide2838A02 (healthy donor vs.Jurkat)

(b) MA-plot from slide2839A02 (RNA from samesamples used as for slide2838A02)

(c) Regression plot of 2 repli-cated slides.

Figure 3.13: Quality control using MA- and Regression-plots

genes reveal a common gene expression pattern and cluster together which is an indicator for

high quality hybridizations (figure3.14). In the presented preliminary study, Correspondence

Analysis (CA) [128] has been used to distinguish between different groups of patients and to

elucidate differences in gene expression between unstimulated and in-vitro stimulated T-cells

extracted from healthy donors.

CA reveals a clearly separation between controls (green) and lung cancer patients (red) (figure

3.15(a)). The means of the experiments are going in the opposite direction, which is an indica-

tion of significant differences in gene regulation between the two groups under investigation.

The second experiment tested the resulting differences between non stimulated and in-vitro

Figure 3.14: Comparison of duplicated spots using HCL: The results reveal consistency for theselected spots.

Page 67: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.5. ANALYSIS OF PHENOTYPIC DATA 56

stimulated samples from healthy donors. Again, CA has been applied to the data set to discrimi-

nate between the four groups: normals, normals stimulated with CD3/CD28, CD3/CD28/IL-10,

and CD3/CD28/IL-2 (figure3.15(b)).

(a) The mean experiments of each group are goingalmost exactly in the opposite direction.

(b) Different genetic regulation based on in-vitrostimulations.

Figure 3.15: Correspondence Analysis reveals different groups of patients and detects differentgroups based on in-vitro stimulations.

3.5 Analysis of Phenotypic Data

Currently, TME.db contains clinical as well as experimental data from more than 1,000 patients

and donors (table3.3). Most of the experimental data represent phenotypic information gath-

ered from flow cytometry experiments where more than 250 molecular marker combinations

have been analyzed in order to unambiguously characterize the cells.

Starting from this unique combination of immunophenotypic parameters, studies to categorize

patients and to detect subtle differences between different sample materials as well as markers

which are differentially expressed in these groups have been initiated. For these studies Gene-

sis, a software suite initially developed for gene expression analysis, has been extensively used.

To reveal that it is possible to distinguish between lung and colon cancer patients based on

specific expression patterns of molecular markers corresponding to different T-cell populations,

data from colon and lung cancer patients have been analyzed. Although the means of the exper-

Page 68: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.5. ANALYSIS OF PHENOTYPIC DATA 57

Primary Cancer Number of PatientsNo 68

Breast 5Colon 650

Lymphoma 2Pleura 1Lung 38

Rectum 266Sum 1,030

Table 3.3: Overview of patients/donors stored in TME.db.

iments in CA are going in the opposite directions, they are quite close to each other which indi-

cates that there are only slight differences between the two groups (figure3.16(a)). ANOVA has

been used to extract a list of significantly (p < 0.05) differentially regulated markers. Among

these are especially T-cell markers which are higher expressed in colon cancer patients than in

lung cancer patients (figure3.16(b)) indicating a higher immune activity in the group of colon

cancer patients.

(a) Correspondence Analysis reveals minor differ-ences between lung and colon cancer patients

(b) Significantly differentially regulated molecularmarkers in lung and colon cancer patients

Figure 3.16: Comparison of lung and colon cancer patients

To compare the immune infiltrates in different sample materials phenotypic data from 38 lung

cancer patients has been analyzed. Using CA and Principal Component Analysis (PCA) over

the experiments, significant differences between the 18 samples from tumor material and the 20

Page 69: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.5. ANALYSIS OF PHENOTYPIC DATA 58

samples from pleural liquid have been detected (figure3.17). Looking at the expression patterns

of these two groups remarkable observations can be made regarding T-cells. As an example, the

marker CD62L is detected in cells from the pleural liquid and not in tumor cells, which indi-

cates that T-cells are present in pleural liquid, but not in tumor. Furthermore, CD49d+, CD28+,

CD127+, and CD120b+ cells, which also are markers for T-cells, and CD56+, a marker for

NK-cells (a subset of T-cells) express almost the same pattern (figure3.17(c)) as the CD62L+

ones, that substantiates the result that there are active T-cells in pleural liquid.

(a) PCA-experiments reveals distinct groups corre-sponding to the sample material

(b) CA also yields a similar result

(c) T-cell markers are expressed in pleural liquid and not in tumor samples

Figure 3.17: Comparing the sample material gives information about T-cell activity

A similar study has been conducted to analyze different sample materials from colon cancer

patients. Therefore data from 62 colon cancer patients has been selected and CA as well as

ANOVA (p < 0.01) have been applied to the data set (figure3.18). The results are as follows:

Blood samples are showing different expression patterns compared to samples from tumor and

Page 70: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

3.5. ANALYSIS OF PHENOTYPIC DATA 59

(a) CA analysis reveals differences between differ-ent sample materials from colon cancer patients.

(b) ANOVA detects a number of markers differen-tially expressed in different sample materials fromcolon cancer patients.

Figure 3.18: Comparing the sample materials of colon cancer patients shows distinct expressionprofiles.

adjacent not tumoral tissue. The analysis of the markers revealed that in tumor as well as in

not tumoral samples the immune system is responding against cancer. The cells participating

in the immune reaction are not the same or at least do not bear the same markers in periphery

and in the tissue. In order to detect differences between tumor and not tumoral biopsy samples

ANOVA (p < 0.01) has been applied again to these two groups. Although a p-value less than

0.01 is very restrictive, 4 markers have been detected to be significantly different (figure3.19).

These are CD98 and HLA-DR which are activation markers expressed on T-cells, after they

have been stimulated; GITR that is expressed on a subpopulation of regulatory T-cells, which

may inhibit other T-cells; and CD3/CD56, a subpopulation of cells intermediate between T-cells

and natural killer cells (NK-cells).

Figure 3.19: CD98, HLA-DR, GITR, and CD3/CD56 have been detected as significantly regu-lated between tumor and not tumoral biopsy samples.

Page 71: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Chapter 4

Discussion

In this thesis we have developed a comprehensive and versatile bioinformatics platform for can-

cer immunogenomics which addresses the needs of whole system approaches in this new and

ambitious research area. In order to unravel the immune system complexity and to identify key

components involved in cancer aetiology, progression, and immuno-surveillance, the usage of

complex data management systems, which integrate public and proprietary databases, clinical

data sets, and results from high-throughput screening technologies, provides a new and promis-

ing approach. The presented platform addresses these requirements and integrates both clinical

and high-throughput experimental data. It consists of two databases MARS, a MIAME compli-

ant microarray database, and TME.db, a database dedicated to the management of highly sensi-

tive patient related information and experimental data from large-scale flow cytometry analyses.

Both databases are Web based applications employing J2EE as their software architectural plat-

form, which not only eases the development but also facilitates the integration of resulting data.

J2EE technology is beside Microsoft .Net the leading enterprise development platform support-

ing 3-tier architectures and has been used due its platform independency, free availability, and

the expertise of the participating developers. Compared to 2-tier architectures, employing a 3-

tier architecture brings about the following advantages: a) it is easier to modify or replace any

tier without affecting the other tiers (maintenance), b) separating the application and database

functionality leads to better load balancing and therefore supports an increasing number of users

or more demanding tasks, c) adequate security policies can be enforced within the server tiers

without hindering the clients, and d) client machines require moderate computing power and

memory since all business logic is performed in the application server. The presentation logic

60

Page 72: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

DISCUSSION 61

has been developed using the Jakarta Struts framework which encourages application architec-

tures based on the MVC approach. Based on published standards and proven design patterns,

this framework provides the essential underpinnings for developing Web based applications.

The business logic, which is responsible for the work an application needs to do, is implemented

in the middle tier using Enterprise JavaBeans. These beans are deployed to the open-source ap-

plication server JBoss. JBoss is a full featured, J2EE compliant application server offering a

unique architecture built on a microkernel-based design that leverages JMX extensions. The

common Enterprise Information System for both applications is Oracle, a well established and

performant relational Database Management System (DBMS). Since both software packages

do not use database specific code, the underlying DBMS can be exchanged depending on the

requirements or availabilities.

MARS as well as TME.db are extensively using Design Patterns, which are proven solutions

to recurring problems. Using design patterns has several major benefits. First, they provide a

way to solve issues related to software development using a proven solution which facilitates

the development of highly cohesive modules with minimal coupling. Second, design patterns

make communication between designers more efficient since developers can immediately pic-

ture the design in their heads when they refer the name of the pattern used to solve a particular

issue. Third, the combination of these benefits positively influences the maintainability of the

developed software.

However, the presented applications not only have similarities, their specific purposes as well

as their differences will be discussed in the next paragraphs.

MARS is a MIAME compliant microarray database that enables the storage, retrieval, and anal-

ysis of multi-color microarray experiment data coming from multiple users and institutions.

Although there are several other commercial as well as academic software packages available

that claim to provide most of the features required for our purposes, several drawbacks became

evident during their evaluation. Commercial packages lacked the possibility to extend them with

programs and features developed by the bioinformatics community whereas academic packages

were mostly written in programming languages like PHP or Perl and therefore lack scalability.

Based on the outcome of the evaluation of already available microarray databases, we decided

to develop an independent system influenced by outstanding concepts of available packages and

based on the aforementioned 3-tier software architecture. MARS covers the complete microar-

ray workflow starting from microarray production to the concluding data analysis. Besides the

Page 73: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

DISCUSSION 62

parameters necessary to comply to the proposed MIAME standard the elaborated microarray

workflow suggests the integration of additional quality control parameters which are indispens-

able for high-quality microarray experiment data. To meet these requirements MARS has been

extended by MARS-QM [129] a powerful quality management system which stores parameters

generated by the standard quality control procedures conducted during the microarray produc-

tion as well as during the sample preparation, RNA extraction and hybridization processes.

MARS-QM is able to trace back all conducted quality control steps and therefore assures high-

quality data or rather facilitates the understanding of lower value data.

MARS provides external data interfaces which allow the connection of third party applica-

tions either through Web services or MARSExplorer, a Java API for the MARS Web service

software developers can use to extend their Java programs with data access functionality. Us-

ing these interfaces allows on the one hand to connect well established Java applications like

ArrayNorm [118], Genesis [63], or the PathwayMapper [130] via MARSExplorer and on the

other hand, the Web service interface enables software packages like BioConductor [121] or

Matlab [122], which are not based on Java, to successfully establish a connection to MARS

completely independent of the platform they use. Both interfaces provide the same methods,

which allow to browse through experiments, download raw data and upload results to a certain

extent after analysis has been performed. This approach differs from the approach of writing

plug-ins in programming languages users are not familiar with as required by other software

packages (e.g. BASE [111]) to provide high-level analysis features.

In order to share microarray experiment data with other institutions or to deposit data into repos-

itories like ArrayExpress [115] or GEO [120] , MAGE-ML [ 119] support has been integrated to

MARS. As all time consuming tasks, like the upload of an array design, the MAGE-ML export

has been implemented as an asynchronous task using Message Driven Beans. This implemen-

tation approach results in the following benefits: First it enables users to continue working

without having to wait until these processes have been finished, and second it facilitates the

distribution of computational intensive and memory critical tasks and therefore enables to use

the servers full capacities.

TME.db has been developed, in order to provide means and data particularly relevant to tu-

mor biology, diagnosis management, and the treatment of cancer patients. The developed sys-

tem is able to record detailed information on patient related data as well as details on personal

problems, the medical history including cancer treatment and surgery, and cancer staging. Ad-

Page 74: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

DISCUSSION 63

ditionally, experimental data gathered from phenotypical and functional cell analysis using flow

cytometry can be stored and available gene expression data can be linked using the TME.db-

MARS connection. In order to facilitate experiment management, TME.db has integrated a

simple Labor Information Management System (LIMS), which is managing all details about

the sample materials as well as their storage. Furthermore, detailed description of the material

is provided and all treatments a sample is passing through biopsy to final data analysis can be

tracked. Although MARS also contains a LIMS to describe the samples, the implementation of

these features was essential to be able to run TME.db as an independent application beyond the

context of MARS. However, a link to the sample management of MARS should be established

in the near future. The feasibility has been successfully demonstrated by integrating microarray

experiment data (hybridizations) into TME.db using RMI. RMI, in contrast to SOAP, does not

allow to perform requests over network boundaries like firewalls and therefore its usage is only

appropriate within a specified network environment. Nevertheless, RMI is much faster, since it

has not to bear the burden of the XML based overhead of the SOAP protocol. To allow users

to mine the database in order to find exactly this piece of information they are looking for, so-

phisticated query mechanisms have been integrated, which allow the combination of almost all

fields available in TME.db.

Since TME.db is handling sensitive patient related information special attention has been payed

on security and in this context on arising Ethical, Legal, and Social Implications (ELSI). In

order to avoid that specific patients could be traced back either exactly or probabilistically by

unauthorized people, well thought-out security mechanisms have been integrated. The TME.db

server is placed behind a firewall which only allows user with a particular IP address to connect

via HTTPS. The underlaying Authentication and Authorization System (AAS) is managing au-

thentication using a username/password combination enforcing strong passwords. Moreover,

the directed usage of access controls protects sensitive data from being processed by authenti-

cated but unauthorized users.

A mentionable difference between MARS and TME.db is the way the applications have been

developed. For MARS, we first concentrated on the design of the data model. Based on this data

model, we started to model the relations of Enterprise Java Beans (EJB) and implemented the

data source as well as business logic. In contrast, TME.db has been developed using the Model

Driven Architecture (MDA) approach. After modeling the relations between the EJBs using

UML (Unified Modeling Language), the code generation templates for developing Enterprise

Page 75: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

DISCUSSION 64

Java Applications have been applied to the model and code has been generated. Only the code

for executing business task has to be coded by the application developers, the code responsible

for the relationships as well as data source logic has been generated automatically. Although

both ways to develop enterprise applications have several advantages as well as disadvantages,

MDA is better suited as soon as the technology may be changed since it is only necessary to

adopt and revise the templates instead of the whole code.

Since no software suite is complete and satisfies all the needs, we are working on several new

features. In order to automatically annotate the microarrays a connetion from MARS to the

large scale sequence annotation systemAnnotator[131] is currently in preparation. Further en-

hancements will include the storage and analysis of Affymetrix arrays; the implementation of a

quality control matrix to gain an overview of the quality of spotted arrays, biological samples,

and hybridizations; and the implementation of an experiment based molecule update to keep

molecule information up-to-date without affecting the initial information and experiments from

other users. TME.db will be enhanced by the possibility to store low density array RT-PCR (re-

verse transcriptase-polymerase chain reaction) data and to integrate information gathered from

tissue microarray experiments [132].

Preliminary microarray studies and analyses of phenotypic data gathered from flow cytome-

try experiments have successfully revealed the usability of this unique bioinformatics platform

for cancer immunogenomics. In general, it has been shown that these data can be used to dis-

tinguish between different groups of patients and to detect subtle differences between them.

Furthermore the analysis of the resulting data sets with sophisticated analysis tools allowed the

detection of markers which seem to be crucial in cancer diagnosis as well as treatment. Al-

though the results are very promising, further analysis has to performed on specific markers to

verify the results.

In summary, this is the first time a platform dedicated to the management, storage, and analysis

of diverse data accumulating in immunogenomic experiments has been presented. The unique

integration of microarray and phenotypic flow cytometry data, clinical data sets, and proposed

analysis methods provides means to successfully accomplish immunogenomic studies and may

ultimately help to improve cancer diagnosis and treatment.

Page 76: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Bibliography

[1] Riddell SR. Progress in cancer vaccines by enhanced self-presentation.Proc Natl Acad Sci U SA, 98:8933–8935, 2001.

[2] S E Hiscox, M B Hallett, M C Puntis, T Nakamura, and W G Jiang. Expression of the HGF/SFreceptor, c-met, and its ligand in human colorectal cancers.Cancer Invest, 15(6):513–521, 1997.

[3] Hess D, Li L, Martin M, Sakano S, Hill D, Strutt B, Thyssen S, Gray DA, and Bhatia M. Bonemarrow-derived stem cells initiate pancreatic regeneration.Nat Biotechnol, 21:763–770, 2003.

[4] Lee CK, Klopp RG, Weindruch R, and Prolla TA. Gene expression profile of aging and its retar-dation by caloric restriction.Science, 285:1390–1393, 1999.

[5] American Cancer Society. Cancer facts and figures 2004. WWW, 2004. http://www.cancer.org.

[6] Janis Kuby, editor.Immunology. W.H. Freeman and Company, 2000.

[7] Liotta LA and Kohn EC. The microenvironment of the tumour-host interface.Nature, 411:375–379, 2001.

[8] Liotta L and Petricoin E. Molecular profiling of human cancer.Nat Rev Genet, 1:48–56, 2000.

[9] Schena M, Shalon D, Davis RW, and Brown PO. Quantitative monitoring of gene expressionpatterns with a complementary dna microarray.Science, 270:467–470, 1995.

[10] Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA, Bassett DEJr, Hieter P, Vogelstein B,and Kinzler KW. Characterization of the yeast transcriptome.Cell, 88:–.

[11] C M St Croix, B J Morgan, T J Wetter, and J A Dempsey. Fatiguing inspiratory muscle workcauses reflex sympathetic activation in humans.J Physiol, 529 Pt 2:493–504, Dec 2000.

[12] Nevins JR, Huang ES, Dressman H, Pittman J, Huang AT, and West M. Towards integratedclinico-genomic models for personalized medicine: Combining gene expression signatures andclinical factors in breast cancer outcomes prediction.Hum Mol Genet, pages –, 2003.

[13] F Bertucci, R Houlgatte, C Nguyen, P Viens, B R Jordan, and D Birnbaum. Gene expressionprofiling of cancer by use of DNA arrays: how far from the clinic?Lancet Oncol, 2(11):674–682,Nov 2001.

[14] Steve Mohr, George D Leikauf, Gerard Keith, and Bertrand H Rihn. Microarrays as cancer keys:an array of possibilities.J Clin Oncol, 20(14):3165–3175, Jul 2002.

65

Page 77: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 66

[15] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML,Downing JR, Caligiuri MA, Bloomfield CD, and Lander ES. Molecular classification of cancer:class discovery and class prediction by gene expression monitoring.Science, 286:531–537, 1999.

[16] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, TranT, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Jr., Lu L, Lewis DB, Tibshirani R,Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM, and. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling.Nature,403:503–511, 2000.

[17] M E Garber, O G Troyanskaya, K Schluens, S Petersen, Z Thaesler, M Pacyna-Gengelbach,M van de Rijn, G D Rosen, C M Perou, R I Whyte, R B Altman, P O Brown, D Botstein, andI Petersen. Diversity of gene expression in adenocarcinoma of the lung.Proc Natl Acad Sci U SA, 98(24):13784–13789, Nov 2001.

[18] Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG,Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, andHanash S. Gene-expression profiles predict survival of patients with lung adenocarcinoma.NatMed, 8:816–824, 2002.

[19] Francois Bertucci, Valery Nasser, Samuel Granjeaud, Francois Eisinger, Jose Adelaide, Re-becca Tagett, Beatrice Loriod, Aurelia Giaconia, Athmane Benziane, Elisabeth Devilard, JocelyneJacquemier, Patrice Viens, Catherine Nguyen, Daniel Birnbaum, and Remi Houlgatte. Gene ex-pression profiles of poor-prognosis primary breast cancer correlate with survival.Hum Mol Genet,11(8):863–872, Apr 2002.

[20] Dinesh Singh, Phillip G Febbo, Kenneth Ross, Donald G Jackson, Judith Manola, Christine Ladd,Pablo Tamayo, Andrew A Renshaw, Anthony V D’Amico, Jerome P Richie, Eric S Lander, Mas-simo Loda, Philip W Kantoff, Todd R Golub, and William R Sellers. Gene expression correlatesof clinical prostate cancer behavior.Cancer Cell, 1(2):203–209, Mar 2002.

[21] ’t VeerL, Dai H, van deVijver, He YD, Hart AA, Mao M, Peterse HL, van derK, Marton MJ,Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, and Friend SH.Gene expression profiling predicts clinical outcome of breast cancer.Nature, 415:530–536, 2002.

[22] J L Weaver. Introduction to flow cytometry.Methods, 21(3):199–201, Jul 2000. Editorial.

[23] E Engvall and P Perlman. Enzyme-linked immunosorbent assay (ELISA). Quantitative assay ofimmunoglobulin G.Immunochemistry, 8(9):871–874, Sep 1971.

[24] Glynne RJ, Ghandour G, and Goodnow CC. Genomic-scale gene expression analysis of lympho-cyte growth, tolerance and malignancy.Curr Opin Immunol, 12:210–214, 2000.

[25] McCabe ER. Translational genomics in medical genetics.Genet Med, 4:468–471, 2002.

[26] Rosell R, Monzo M, O’Brate A, and Taron M. Translational oncogenomics: toward rationaltherapeutic decision-making.Curr Opin Oncol, 14:171–179, 2002.

[27] Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, CherryJM, Botstein D, Brown PO, and Alizadeh AA. Source: a unified genomic resource of functionalannotations, ontologies, andgene expression data.Nucleic Acids Res, 31:219–223, 2003.

Page 78: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 67

[28] Forster J, Gombert AK, and Nielsen J. A functional genomics approach using metabolomics andin silico pathway analysis.Biotechnol Bioeng, 79:703–712, 2002.

[29] Molidor R, Sturn A, Maurer M, and Trajanoski Z. New trends in bioinformatics: from genomesequence to personalized medicine.Exp Gerontol, 38:1031–1036, 2003.

[30] Altman RB and Klein TE. Challenges for biomedical informatics and pharmacogenomics.AnnuRev Pharmacol Toxicol, 42:113–133, 2002.

[31] Collins FS, Green ED, Guttmacher AE, and Guyer MS. A vision for the future of genomicsresearch.Nature, 422:835–847, 2003.

[32] Oosterhuis JW, Coebergh JW, and van Veen EB. Tumour banks: well-guarded treasures in theinterest of patients.Nat Rev Cancer, 3:73–77, 2003.

[33] Sujansky W. Heterogeneous database integration in biomedicine.J Biomed Inform, 34:285–298,2001.

[34] Stein LD. Integrating biological databases.Nat Rev Genet, 4:337–345, 2003.

[35] Stein L. Creating a bioinformatics nation.Nature, 417:119–120, 2002.

[36] Gardiner-Garden M and Littlejohn TG. A comparison of microarray databases.Brief Bioinform,2:143–158, 2001.

[37] Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W,Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, MateseJC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, andVingron M. Minimum information about a microarray experiment (miame)-toward standards formicroarray data.Nat Genet, 29:365–371, 2001.

[38] Sam Hanash. Building a foundation for the human proteome: the role of the Human ProteomeOrganization.J Proteome Res, 3(2):197–199, Mar 2004.

[39] Gene Ontology Consortium. Creating the gene ontology resource: design and implementation.Genome Res, 11:1425–1433, 2001.

[40] Liebman MN. Biomedical informatics: the future for drug development.Drug Discov Today,7:S197–S203, 2002.

[41] Duggan DJ, Bittner M, Chen Y, Meltzer P, and Trent JM. Expression profiling using cdna mi-croarrays.Nat Genet, 21:10–14, 1999.

[42] Zarrinkar PP, Mainquist JK, Zamora M, Stern D, Welsh JB, Sapinoso LM, Hampton GM, andLockhart DJ. Arrays of arrays for high-throughput gene expression profiling.Genome Res,11:1256–1261, 2001.

[43] Lockhart DJ and Winzeler EA. Genomics, gene expression and dna arrays.Nature, 405:827–836,2000.

[44] Hegde P, Qi R, Abernathy K, Gay C, Dharap S, Gaspard R, Hughes JE, Snesrud E, Lee N, andQuackenbush J. A concise guide to cdna microarray analysis.Biotechniques, 29:548–556, 2000.

Page 79: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 68

[45] Burgess JK. Gene expression studies using microarrays.Clin Exp Pharmacol Physiol, 28:321–328, 2001.

[46] Tiff revision 6.0. WWW, June 1992. http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf.

[47] Yuen T, Wurmbach E, Pfeffer RL, Ebersole BJ, and Sealfon SC. Accuracy and calibration ofcommercial oligonucleotide and custom cdna microarrays.Nucleic Acids Res, 30:e48 –e48, 2002.

[48] Stears RL, Martinsky T, and Schena M. Trends in microarray analysis.Nat Med, 9:140–145,2003.

[49] Michael Baum, Simone Bielau, Nicole Rittner, Kathrin Schmid, Kathrin Eggelbusch, MichaelDahms, Andrea Schlauersbach, Harald Tahedl, Markus Beier, Ramon Guimil, Matthias Scheffler,Carsten Hermann, Jorg-Michael Funk, Anke Wixmerten, Hans Rebscher, Matthias Honig, ClaasAndreae, Daniel Buchner, Erich Moschel, Andreas Glathe, Evelyn Jager, Marc Thom, AndreasGreil, Felix Bestvater, Frank Obermeier, Josef Burgmaier, Klaus Thome, Sigrid Weichert, SilkeHein, Tim Binnewies, Volker Foitzik, Manfred Muller, Cord Friedrich Stahler, and Peer FriedrichStahler. Validation of a novel, fully integrated and flexible microarray benchtop facility for geneexpression profiling.Nucleic Acids Res, 31(23):e151, Dec 2003.

[50] Wu W, Hodges E, Redelius J, and Hoog C. A novel approach for evaluating the efficiency ofsirnas on protein levels in cultured cells.Nucleic Acids Res, 32:E17–E17, 2004.

[51] Yang IV, Chen E, Hasseman JP, Liang W, Frank BC, Wang S, Sharov V, Saeed AI, White J,Li J, Lee NH, Yeatman TJ, and Quackenbush J. Within the fold: assessing differential expres-sion measures and reproducibility in microarray assays.Genome Biol, 3:RESEARCH0062.1–RESEARCH0062.12, 2002.

[52] Leung YF and Cavalieri D. Fundamentals of cdna microarray data analysis.Trends Genet,19:649–659, 2003.

[53] Churchill GA. Fundamentals of experimental design for cdna microarrays.Nat Genet, 32Suppl:490–495, 2002.

[54] Hackl H, Cabo FS, Sturn A, Wolkenhauer O, and Trajanoski Z. Analysis of dna microarray data.Curr Top Med Chem, 4:1357–1370, 2004.

[55] Quackenbush J. Microarray data normalization and transformation.Nat Genet, 32 Suppl:496–501,2002.

[56] Kerr MK, Martin M, and Churchill GA. Analysis of variance for gene expression microarray data.J Comput Biol, 7:819–837, 2000.

[57] Kerr MK and Churchill GA. Statistical design and the analysis of gene expression microarraydata.Genet Res, 77:123–128, 2001.

[58] M Schena, D Shalon, R Heller, A Chai, P O Brown, and R W Davis. Parallel human genomeanalysis: microarray-based expression monitoring of 1000 genes.Proc Natl Acad Sci U S A,93(20):10614–10619, Oct 1996.

Page 80: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 69

[59] DeRisi JL, Iyer VR, and Brown PO. Exploring the metabolic and genetic control of gene expres-sion on a genomic scale.Science, 278:680–686, 1997.

[60] Slonim DK. From patterns to pathways: gene expression data analysis comes of age.Nat Genet,32 Suppl:502–508, 2002.

[61] Motulsky Harvey.Intuitive Biostatistics. Oxford University Press, 1995.

[62] Park PJ, Pagano M, and Bonetti M. A nonparametric scoring algorithm for identifying informativegenes from microarray data.Pac Symp Biocomput, 1998:52–63, 2001.

[63] Sturn A, Quackenbush J, and Trajanoski Z. Genesis: cluster analysis of microarray data.Bioin-formatics, 18:207–208, 2002.

[64] Eisen MB, Spellman PT, Brown PO, and Botstein D. Cluster analysis and display of genome-wideexpression patterns.Proc Natl Acad Sci U S A, 95:14863–14868, 1998.

[65] Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, and Church GM. Systematic determination ofgenetic network architecture.Nat Genet, 22:281–285, 1999.

[66] Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, and Golub TR.Interpreting patterns of gene expression with self-organizing maps: methods and application tohematopoietic differentiation.Proc Natl Acad Sci U S A, 96:2907–2912, 1999.

[67] Raychaudhuri S, Stuart JM, and Altman RB. Principal components analysis to summarize mi-croarray experiments: application to sporulation time series.Pac Symp Biocomput, 2000:455–466,2000.

[68] Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Jr., and Haussler D.Knowledge-based analysis of microarray gene expression data by using support vector machines.Proc Natl Acad Sci U S A, 97:262–267, 2000.

[69] Harvey Lodish, editor.Molecular Cell Biology. W.H. Freeman and Company, 2000.

[70] D R Parks and L A Herzenberg. Fetal cells from maternal blood: their selection and prospects foruse in prenatal diagnosis.Methods Cell Biol, 26:277–295, 1982.

[71] James A Strauchen. Immunophenotypic and molecular studies in the diagnosis and classificationof malignant lymphoma.Cancer Invest, 22(1):138–148, 2004.

[72] Shaw S, Turni L, and Katz K. Protein Reviews on The Web.Immunology Today, 18:557–558,1998.

[73] Prow: Protein reviews on the web. WWW, 2004. http://www.ncbi.nlm.nih.gov/prow/.

[74] Lyons A. Bruce. Analysing cell division in vivo and in vitro using flowcytometric measurementof CFSE dye dilution.Journal of Immunological Methods, (243):147–154, 2000.

[75] Fowler Martin.Patterns of Enterprise Application Architecture. Addison-Wesley Signature Series,2003.

Page 81: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 70

[76] Java 2 platform enterprise edition specification, v1.4. WWW, November 2003.http://java.sun.com/j2ee/j2ee-14-fr-spec.pdf.

[77] Java 2 platform, standard edition (j2se). WWW. http://java.sun.com/j2se/index.jsp.

[78] Java 2 platform enterprise edition application model. WWW, 2004.http://java.sun.com/j2ee/appmodel.html.

[79] Bodoff Stephanie.The J2EE Tutorial. Addison-Wesley Professional, 2004.

[80] Enterprise javabeans specification, version2.1. WWW, November 2003.http://java.sun.com/products/ejb/docs.html.

[81] Monson-Haefel Richard.Enterprise JavaBeans. O’Reilly, 2001.

[82] Java message service api overview. WWW, 2004. http://java.sun.com/products/jms/overview.html.

[83] J2ee connector architecture overview. WWW, 2004. http://java.sun.com/j2ee/connector/overview.html.

[84] Java servlet technology overview. WWW, 2004. http://java.sun.com/products/servlet/overview.html.

[85] Maurer Michael. Design and Development of a Bioinformatics Platform for Large-Scale GeneExpression Profiling. PhD thesis, Institute for Genomics and Bioinformatics, Graz University ofTechnology, 2004.

[86] Struts. WWW, 2004. http://jakarta.apache.org/struts.

[87] Brown Simone, Burdick Robert, and Falkner Jason.Professional JSP. WROX Press, 2nd edition,2001.

[88] Goodwill James.Mastering Jakarta Struts. Wiley Computing Publishing, 2002.

[89] Cavaness Chuck.Programming Jakarta Struts. O’Reilly, 2002.

[90] Thallinger GG, Trajanoski S, Stocker G, and Trajanoski Z. Information management systems forpharmacogenomics.Pharmacogenomics, 3:651–667, 2002.

[91] Soap version 1.2 part 0: Primer. WWW, June 2003. http://www.w3.org/TR/2003/REC-soap12-part0-20030624/.

[92] Web services description language (wsdl) 1.1. WWW, March 2001. http://www.w3.org/TR/wsdl.

[93] Uddi technical white paper. WWW, September 2000.http://www.uddi.org/pubs/IruUDDI TechnicalWhite Paper.pdf.

[94] Singh Inderjeed.Designing Web Services with the J2EE 1.4 Platform. Addison-Wesley. The JavaSeries, 2004.

[95] Java management extensions (jmx) overview. WWW, 2004.http://java.sun.com/products/jmx/overview.html.

[96] Schmelzle Oliver. Verwaltungsdienst.iX, (6):116–119, 2003.

Page 82: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 71

[97] Java management extensions instrumentation and agent specification, v1.2. WWW, 2004.http://jcp.org/aboutJava/communityprocess/final/jsr003/index3.html.

[98] Codd E.M.The Relational Model for Data Base Management. Addison-Wesley, 1990.

[99] Sql tutorial. WWW, 2004. http://www.w3schools.com/sql/default.asp.

[100] Gamma Erich, Helm Richard, Johnson Ralph, and Vlissides John.Design Patterns Elements ofReusable Object-Oriented Software. Addison Wesley, 1995.

[101] Alur Deepak, Crupi John, and Malks Dan.Core J2EE Patterns: Best Practices and DesignStrategies.

[102] Getting started in .net. WWW. http://www.microsoft.com/net/.

[103] Omg model driven architecture. WWW, 2004. http://www.omg.org/mda.

[104] Stahl Thomas and Voelter Markus. Moderne Zeiten.iX, (3):100–103, 2003.

[105] Andromda - from uml to deployable components. WWW, 2004. http://www.andromda.org.

[106] Uml resource page. WWW, 2004. http://www.uml.org.

[107] Zeller Dieter. Design and development of a user managment system for molecular biologydatabase systems. Master’s thesis, Institute for Genomics and Bioinformatics, Graz Universityof Technology, 2003.

[108] Jboss. WWW, 2004. http://www.jboss.org.

[109] Apache jakarta tomcat. WWW, 2004. http://jakarta.apache.org/tomcat.

[110] Oracle. WWW, 2004. http://www.oracle.com.

[111] Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg A, and Peterson C. Bioarray soft-ware environment (base): a platform for comprehensive management and analysis of microarraydata.Genome Biol, 3:SOFTWARE0003.1–SOFTWARE0003.6, 2002.

[112] Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thia-garajan M, Sturn A, Snuffin M, Rezantsev A, Popov D, Ryltsov A, Kostukovich E, Borisovsky I,Liu Z, Vinsavich A, Trush V, and Quackenbush J. Tm4: a free, open-source system for microarraydata management and analysis.Biotechniques, 34:374–378, 2003.

[113] C Stoeckert, A Pizarro, E Manduchi, M Gibson, B Brunk, J Crabtree, J Schug, S Shen-Orr, andG C Overton. A relational schema for both array-based and SAGE gene expression experiments.Bioinformatics, 17(4):300–308, Apr 2001.

[114] Manduchi E, Grant GR, He H, Liu J, Mailman MD, Pizarro AD, Whetzel PL, and Stoeckert CJJr.Rad and the rad study-annotator: an approach to collection, organization and exchange of allrelevant information for high-throughput gene expression studies.Bioinformatics, 20:452–459,2004.

Page 83: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 72

[115] Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Ka-pushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, and Sansone SA. Arrayexpress–a public repository for microarray gene expression data at the ebi.Nucleic Acids Res, 31:68–71,2003.

[116] Stoeckert CJ, Jr., Causton HC, and Ball CA. Microarray databases: standards and ontologies.NatGenet, 32 Suppl:469–473, 2002.

[117] T.R. Gruber. A translation approach to portable ontologies.Knowledge Acquisition, (5):199–220,1993.

[118] Pieler R, Sanchez-Cabo F, Hackl H, Thallinger GG, and Trajanoski Z. Arraynorm: comprehensivenormalization and analysis of microarray data.Bioinformatics, pages –, 2004.

[119] Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, BallC, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, PizarroA, White J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ,Jr., and Brazma A. Design and implementation of microarray gene expression markup language(mage-ml).Genome Biol, 3:RESEARCH0046.1–RESEARCH0046.9, 2002.

[120] Edgar R, Domrachev M, and Lash AE. Gene expression omnibus: Ncbi gene expression andhybridization array data repository.Nucleic Acids Res, 30:207–210, 2002.

[121] Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y,Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, RossiniAJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, and Zhang J. Bioconductor: open softwaredevelopment for computational biology and bioinformatics.Genome Biol, 5:R80–R80, 2004.

[122] The mathworks inc, ma, usa. WWW.http://www.mathworks.com/.

[123] Thomas Truskaller. Data Integration into a Gene Expression Database. Master’s thesis, TU-Graz,2003.

[124] Leslie H Sobin. TNM: evolution and relation to other prognostic factors.Semin Surg Oncol,21(1):3–7, 2003. Historical Article.

[125] O H Beahrs. Colorectal cancer staging as a prognostic feature.Cancer, 50(11 Suppl):2615–2617,Dec 1982.

[126] Pkcs #5: Password-based cryptography standard. WWW, 2004.ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-5v2/pkcs5v2-0.pdf.

[127] W. Cleveland. Robust locally weighted regression and smoothing scatterplots.J.Am.Stat.Assoc.,74:829–836, 1979.

[128] Fellenberg K, Hauser NC, Brors B, Neutzner A, Hoheisel JD, and Vingron M. Correspondenceanalysis applied to microarray data.Proc Natl Acad Sci U S A, 98:10781–10786, 2001.

[129] Christoph Thumser. Quality Control for Microarray Production. Master’s thesis, TU-Graz, 2003.

[130] Trost E, Hackl H, Maurer M, and Trajanoski Z. Java editor for biological pathways.Bioinformat-ics, 19:786–787, 2003.

Page 84: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

BIBLIOGRAPHY 73

[131] Large scale sequence annotation. WWW, 2004. http://annotator.imp.univie.ac.at/.

[132] Kononen J, Bubendorf L, Kallioniemi A, Barlund M, Schraml P, Leighton S, Torhorst J, MihatschMJ, Sauter G, and Kallioniemi OP. Tissue microarrays for high-throughput molecular profiling oftumor specimens.Nat Med, 4:844–847, 1998.

Page 85: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Acknowledgments

Major parts of this work were supported by the Austrian Academy of Sciences, the European

Science Foundation, and the GEN-AU:BIN, Bioinformatics Integration Network.

I would like to express my deepest gratitude to my mentor and supervisor Zlatko Trajanoski for

his encouragement, visions, and believing in me. I am deeply grateful to Jerome Galon who

gave me the opportunity to work together with his group at INSERM in Paris. His visions and

ideas had invaluable impact on this thesis. I am indebted to my colleagues and friends Michael

Maurer and Alexander Sturn for developing MARS with me and their invaluable support dur-

ing the whole work. Further thank goes to Bernhard Mlecnik whose contributions to TME.db

were of great value. Sincere thanks are given to all present and past members of the Institute

for Genomics and Bioinformatics for fruitful discussions, their support and friendship. Spe-

cial acknowledgments are dedicated to the people who have contributed to this work: Andreas

Prokesch, Anne Costes, Christoph Thumser, Gerhard Thallinger, Gernot Stocker, Hubert Hackl,

Jurgen Hartler, Marcel Scheideler, Matthieu Camus, Roland Pieler, and Thomas Truskaller.

I owe my loving thanks to my family for their unfailing support, my companion in life Sabine

for her love, encouragement and understanding and my little son Fabian for his encouraging

smile.

74

Page 86: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Publications

Journals

Molidor R, Sturn A, Maurer M, and Trajanoski Z. New Trends in Bioinformatics: From Genome Se-quence to Personalized Medicine.Experimental Gerontology, 38(10): 1031-1036, 2003

Book Chapters

Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Management ofPharmacogenomic Information in Pharmacogenomics Methods and Protocols, Humana Press, Totowa,USA 2004in press

75

Page 87: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Mini-Review

New trends in bioinformatics: from genome sequence

to personalized medicine

Robert Molidor, Alexander Sturn, Michael Maurer, Zlatko Trajanoski*

Institute of Biomedical Engineering and Christian Doppler Laboratory for Genomics and Bioinformatics,

Graz University of Technology, Krenngasse 37, Graz 8010, Austria

Received 21 May 2003; received in revised form 26 June 2003; accepted 30 June 2003

Abstract

Molecular medicine requires the integration and analysis of genomic, molecular, cellular, as well as clinical data and it thus offers a

remarkable set of challenges to bioinformatics. Bioinformatics nowadays has an essential role both, in deciphering genomic, transcriptomic,

and proteomic data generated by high-throughput experimental technologies, and in organizing information gathered from traditional

biology and medicine. The evolution of bioinformatics, which started with sequence analysis and has led to high-throughput whole genome

or transcriptome annotation today, is now going to be directed towards recently emerging areas of integrative and translational genomics, and

ultimately personalized medicine.

Therefore considerable efforts are required to provide the necessary infrastructure for high-performance computing, sophisticated

algorithms, advanced data management capabilities, and-most importantly-well trained and educated personnel to design, maintain and use

these environments.

This review outlines the most promising trends in bioinformatics, which may play a major role in the pursuit of future biological

discoveries and medical applications.

q 2003 Elsevier Inc. All rights reserved.

Keywords: Bioinformatics; Genomics; Personalized medicine

1. Introduction

In the past decade bioinformatics or computational

biology has become an integral part of research and

development in biomedical sciences. In contemplating a

vision for the future of this new branch of life sciences, it is

appropriate to consider the remarkable path that has led to

today’s status of the field. When in the early 1980s methods

for DNA sequencing became widely available, molecular

sequence data expeditiously started to grow exponentially.

After the sequencing of the first microbial genome in 1995,

the genomes of more than 100 organisms have been

sequenced and large-scale genome sequencing projects

have evolved to routine, though still non-trivial, procedures

(Janssen et al., 2003; Kanehisa and Bork, 2003). The

imperative of efficient and powerful tools and databases

became obvious during the realization of the human genome

project, whose completion has been established several

years ahead of schedule. The accumulated data was stored in

the first genomic databases such as GenBank, European

Molecular Biology Laboratory Nucleotide Sequence Data-

base (EMBL), and DNA Data Bank of Japan (DDBJ) and

novel computational methods had to be developed for

further analysis of the collected data (e.g. sequence

similarity searches, functional and structural predictions).

One of the first breakthroughs in the area of bioinformatics

was the introduction of the rapid sequence database search

tool BLAST (Altschul et al., 1990), which nowadays has

become a valuable and indispensable tool in the everyday

life of biomedical research.

Automatic sequencing was the first forerunner and had a

major impact on high throughput generation of various

kinds of biological data such as single-nucleotide poly-

morphisms (SNPs) and expressed sequence tags (ESTs).

Subsequently, other novel high-throughput methods such as

serial analysis of gene expression (SAGE) (Velculescu et al.,

1995) and DNA microarrays (Shalon et al., 1996) have been

0531-5565/$ - see front matter q 2003 Elsevier Inc. All rights reserved.

doi:10.1016/S0531-5565(03)00168-2

Experimental Gerontology 38 (2003) 1031–1036

www.elsevier.com/locate/expgero

* Corresponding author. Tel.: þ43-316-873-5332; fax: þ43-316-873-

5340.

E-mail address: [email protected] (Z. Trajanoski).

Page 88: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

developed to analyze the transcriptional program of a cell,

tissue or organism at a genomic scale.

All this novel experimental procedures are associated

with information technology in a symbiotic relationship. It

is encouraging that the use of high throughput experimental

procedures in combination with computational analysis so

far has revealed a wealth of information about important

biological mechanisms. This review will deliver insight to

the current trends in bioinformatics that may help to bridge

the considerable gap between technical data production and

its use by both, scientists for biological discovery, and

physicians for their daily routine.

2. From sequence to expression

The lifeblood of bioinformatics has been the handling

and presentation of nucleotide and protein sequences and

their annotation. With the advent of novel experimental

techniques for large-scale, genome-wide transcriptional

profiling via microarrays or gene chips, a new field of

gene expression data analysis emerged (Slonim, 2002). This

new momentum to the bioinformatics community has fueled

the hope of getting more insight into the processes

conducted in a cell, tissue or organism.

As more and more researchers adopted the microarray

technology it soon became increasingly clear that simple

data generation is not satisfactory and the challenges lie in

storage, normalization, analysis, visualization of results, and

most importantly in extracting biological meaningful

information about the investigated cellular processes.

Therefore, considerable progress has been made in the last

couple of years to handle and analyze the millions of data

points accumulated by state of the art microarray studies

with tens of thousands of sequences per slide and maybe

hundreds of slides (Brazma et al., 2003).

Several topics of the analytical pipeline, namely image

analysis, normalization, and gene expression data clustering

and classification have been addressed in numerous

publications (Baxevanis, 2003; Brazma et al., 2003;

Ermolaeva et al., 1998). Data interpretation, however,

proliferated just recently and leaves still a lot of room for

new tools to extract knowledge from the increasing amount

of microarray data. A key challenge of bioinformatics in the

future will be to bridge this considerable gap between data

generation and its usability by scientists for incisive

biological discovery.

The evolution of microarray data production to ever-

larger and more complex data sets will enable bioinforma-

ticians to use this huge amount of information for

developing innovative approaches to reverse engineer

biological networks of molecular interactions, which may

unravel the contribution of specific genes and proteins in the

cellular context (D’haeseleer et al., 2000). These new

approaches of gene expression pattern analysis try to

uncover the properties of the transcriptional program by

analyzing relationships between individual genes. This will

be the beginning of an exciting journey towards the ‘holy

grail’ of computational biology: to generate knowledge and

principles from large-scale data and to predict computa-

tionally systems of higher complexity such as the interaction

networks in cellular processes and in the end to present an

accurate and complete representation of a cell or an

organism in silico.

The comparison of DNA sequences of entire genomes

already gives insights into evolutionary, biochemical, and

genetic pathways. Additionally, enabled by the increasing

amount of public available microarray studies, comparative

analysis of the transcriptome of different cell types,

treatments, tissues or even among two or more model

organisms promise to significantly enhance the fundamental

understanding of the universality as well as the specializ-

ation of molecular biological mechanisms. The objective is

to develop mathematical tools that are able to distinguish the

similar from the dissimilar among two or more large-scale

data sets.

Although new innovative procedures to analyze

genomic data are still desirable, one problem during the

analysis of gene expression data is not the lack of

algorithms and tools, but the multiplicity of practices

available to choose from. Moreover, these methods are

difficult to compare and each method has its own

implementation and frequently a different data format

and representation. This diversity of methods makes it

difficult and time consuming to compare results from

different analyses. Therefore standardized data exchange

and calculation platforms, which allow the straightforward

and efficient application of different algorithms to the data

one is interested in, are and will be highly welcomed by

the research community (Box 1).

3. Integrative genomics

Genes and gene products do not function independently.

They contribute to complex and interconnected pathways,

networks and molecular systems. The understanding of

these systems, their interactions, and their properties will

require information from several fields, like genomics,

proteomics, metabolomics or systematic phenotype profiles

at the cell and organism level (Collins et al., 2003).

Database technologies and computational methods have

to be improved to facilitate the integration and visualization

of these different data types, ranging from genomic data to

biological pathways (Diehn et al., 2003). The integration of

pathway information with gene expression studies for

instance has the potential to reveal differentially regulated

genes under certain physiological conditions in a specific

cellular component (Forster et al., 2002). Furthermore,

connecting protein specific databases to genomic databases

will be crucial to answer upcoming proteomic questions

(Boguski and McIntosh, 2003).

R. Molidor et al. / Experimental Gerontology 38 (2003) 1031–10361032

Page 89: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Sophisticated computational technologies have to be

developed to enable life scientists to establish relationships

between genotype and the corresponding biological func-

tions, which may yield to new insights about physiological

processes in normal and disease states.

4. Translational genomics

Genomic research is now entering an era where emerging

data of sequencing projects and integrative genomics will

help investigators to ultimately unravel the genetic com-

ponents of common and complex diseases. The much

anticipated complete sequence of the human genome,

coupled with the emergence of the sequences of other

animal, plant, and microbial genomes, now provides us with

an incomparable source of information to address biological

and medical questions. However, this advance in our

knowledge accompanies the recognition that further pro-

gress in technology, information based systems for

integrating genetic studies, large population based research,

increased public awareness of ethical and legal issues, and

education are mandatory (Collins et al., 2003).

A relatively new field employing innovative advances

such as genome-wide array technology and the burgeoning

field of computational biology is aptly entitled ‘translational

research’. The objective is to provide the data and tools

necessary to identify genes that play a role in hereditary

susceptibility to disease and additionally to discover genetic

changes contributing to disease progression and resistance

to therapy (McCabe, 2002; Rosell et al., 2002). Therefore it

is crucial to integrate patient related data such as CT- and

MRI scans, mammography, ultrasound, and the correspond-

ing knowledge of their diagnostic parameters.

Achievements of this mission will be accelerated and

empowered through the refinements and breakthroughs in

research techniques that span biomedical and genomic

methodologies, as well as computational biology. This will

help to make a smooth translation of information from

bench to bed and to better focus on the ongoing process of

disease in the body.

5. Personalized medicine

The 20th century has brought us a broad arsenal of

therapies against all major diseases. However, therapy often

fails to be curative and additionally may cause substantial

side effects. Moreover these drugs have, due to their

widespread use, revealed substantial inter-individual differ-

ences in therapeutic response. Evidence has emerged that a

substantial portion of the variability in drug response is

genetically determined and also age, sex, nutrition, and

environmental exposure are playing important contributory

roles. Thus there is a need to focus on effective therapies of

smaller patient subpopulations that demonstrate the same

disease phenotype, but are characterized by distinct genetic

profiles. Whether and to what extend this individual,

genetics-based approach to medicine results in improved,

economically feasible therapy remain to be seen. However,

the realization of this will require new methods in biology,

informatics and analytical systems that provide an order-of-

magnitude increase in throughput, along with corresponding

decreases in operating costs, enhanced accuracy and

reduced complexity (Mancinelli et al., 2000; Collins et al.,

2003).

6. Challenges

The challenges are to capitalize on the immense potential

of bioinformatics to improve human health and well-being.

Although genome-based analysis methods are rapidly

permeating biomedical research, the challenges of establish-

ing robust paths from genomic information to improved

human health remain immense.

6.1. Data integration

The rapid expansion of biomedical knowledge,

reduction in computing costs, spread of internet access,

and the recent emergence of high throughput structural

and functional genomic technologies has led to a rapid

growth of electronically available data. Today, databases

all around the world contain biomedical data, ranging

from clinical data records for individual patients stored in

clinical information systems to the genetic structure of

various species stored in molecular biology databases

(http://nar.oupjournals.org/cgi/content/full/31/1/1/DC1).

The volume and availability of this kind of data has grown

through a largely decentralized process, which has allowed

organizations to meet specific or local needs without

requiring them to coordinate and standardize their

database implementations. This process has resulted in

diverse and heterogeneous database implementations,

making access and aggregation very difficult (Sujansky,

2001; Stein, 2003).

In molecular biology the data, which has to be

managed, covers a wide range of biological information.

The core data are collections of nucleic and amino acid

sequences and protein structures. There are also many

specialized databases covering topics like Comparative

Genomics, Gene Expression, Genetic and Physical Maps,

Metabolic Pathways and Cellular Regulation (Baxevanis,

2003). Although all of these resources are highly

informative individually, the collection of available

content would have more efficacies if provided in a

unified and centralized context. The management and

integration of these heterogeneous data sources with

widely varying formats and different object semantics is a

difficult task. This issue can be handled only by

increasingly sophisticated electronic mechanisms to

R. Molidor et al. / Experimental Gerontology 38 (2003) 1031–1036 1033

Page 90: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

store, manipulate, and communicate information. One

possibility to facilitate the cross-referencing of disparate

data sources is to introduce standardization of terms and

data formats. For this reason, several efforts are underway

to standardize relational data models and/or object

semantics (Stein, 2002) (Box 1).

6.2. High-performance computing

With the introduction of high throughput technologies

such as sequencing and microarrays the amount of data

that has to be managed, compared and analyzed

increased dramatically. Therefore, the analysis of large-

scale genomic and proteomic data in reasonable time

requires high-performance computing systems. The

impressive and steady improvements of computational

power contributed to the success of high throughput

biological technologies and its research. This is depicted

by the correlation of the exponential increase of

GenBank entries and the number of transistors integrated

on a single chip (Fig. 1). To ensure the steady progress

of bioinformatics and its advantages even more powerful

systems are required to be designed and implemented

(Thallinger et al., 2002).

6.3. Ethical, legal, and social implications (ELSI)

The study of pointed questions of life-science and the

desire to collect and disseminate data pertaining to

biomedical research raise a number of important and

non-trivial issues in ethics and patient confidentiality. The

need to integrate information from various sources, such

as hospital discharge records and clinical questionnaires,

strengthens the problems related to this topic. Even if

anonymization is enforced, a specific person could be

traced back either exactly or probabilistically, due to the

amount of remaining information available (Altman and

Klein, 2002). Although the integration of additional

clinical information would have the potential to dramati-

cally improve human health, nonetheless, it is crucial to

ensure that the availability of clinical phenotypic data or

the like does under no circumstances lead to the loss of

study-subject confidentiality or privacy. Researchers have

to pay attention to these ELSI issues and should not view

them as impediments (Collins et al., 2003; Oosterhuis

et al., 2003).

6.4. Training and education

To be able to accomplish the diverse interdisciplinary

challenges, which genomics and bioinformatics are facing

nowadays and in the future, researchers with the

expertise to understand the biological systems and to

use the information efficiently are required. To widen the

success of bioinformatics not only bioinformaticians

themselves but also bioscientists and physicians using

the computational tools need profound skills in bio- and

computer sciences. To create and interpret results

from bioinformatic approaches in a meaningful and

responsible way, at least a fundamental understanding

Box 1.

Standardization

Given the increasing availability of biomedical

information located at different sites and accessible

mostly over the internet, researchers require new

methods to integrate and exchange data. During the

last years extensible markup language (XML) (http://

www.w3.org/XML/) has emerged as a common

standard for the exchange of data. XML consists of a

set of rules whereby new vocabularies (tags) may be

defined. These tags do not indicate how a document is

formatted, but instead provide semantic context to the

content of the document, as semantics require more

constraints on the logical relationships between data

items. e.g.: a tag for a SNP can only be located between

the start- and end-tag of a coding region.

In the area of microarray databases for instance

(Gardiner-Garden and Littlejohn, 2001; Anderle et al.,

2003), the microarray gene expression data (MGED)

society (http://www.mged.org) proposes with MAGE

an object model and with minimum information about

a microarray experiment (MIAME) (Brazma et al.,

2001) a standard to describe the minimum information

required to unambiguously interpret and verify micro-

array experiments. In adherence to MIAME, which is

required by several journals for manuscript sub-

mission, the microarray gene expression-markup

language (MAGE-ML) was designed based on XML

(Spellman et al., 2002).

The human proteome organization (HUPO) is

currently engaged to define community standards for

data representation in proteomics to facilitate data

comparison, exchange and verification. This organiz-

ation is working on standards for mass-spectrometry,

protein–protein interaction and on a general proteo-

mics format (http://psidev.sourceforge.net).

The BioPathways Consortium is elaborating a

standard data exchange format to enable sharing of

pathway information, such as signal transduction,

metabolic and gene regulatory pathways (http://www.

biopathways.org).

In addition, the Gene Ontology Consortium (http://

www.geneontology.org) provides a structured and

standardized vocabulary to describe gene products in

any organism (Gene Ontology Consortium, 2001).

In clinical settings SNOMED [http://www.snomed.

org] or ICD [http://www.icd.org] have been estab-

lished for a standardized classification of disease and

health related problems (Liebman, 2002).

R. Molidor et al. / Experimental Gerontology 38 (2003) 1031–10361034

Page 91: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

of the used technologies, algorithms, and methods is

indispensable. Moreover, the interdisciplinary character

of this field needs to be enforced by the incorporation of

mathematics and theoretical foundations of physics and

chemistry to detect basic architectures of complex

biological systems. Therefore, adequate training and

education has to be provided for bioinformatics special-

ists in such diverse and interdisciplinary fields as

computer sciences, biology, mathematics, chemistry and

physics (Collins et al., 2003).

Fig. 1. Base pairs (W) to transistors comparison (X): The number of base pairs in GenBank doubles every year (http://www.ncbi.nlm.nih.gov/Genbank/

genbankstats.html), which correlates with the increasing packing density of transistors on a single chip (http://www.intel.com/pressroom/kits/quickreffam.

htm). This emphasizes that the exponential growth of transistor integration on a chip and consequently the rapid development of information processing

technologies have contributed to a great extent to the rapid growth of genomic data.

Fig. 2. Components of integrative and translational genomics, which are the building blocks of present and future bioinformatics applications. The

heterogeneous character of bioinformatics is represented by diverse topics ranging form Genomics to Training and from High-Performance Computing to

ethical, legal, and social implications (ELSI).

R. Molidor et al. / Experimental Gerontology 38 (2003) 1031–1036 1035

Page 92: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

7. Conclusion

It is widely accepted that bioinformatics has led the way

to the post-genomic era and will become an essential part in

future molecular life-sciences. Nowadays, bioinformatics is

facing new challenges of integrative and translational

genomics, which will ultimately lead to personalized

medicine. The ongoing investigations in these areas attempt

to provide researchers with a markedly improved repertoire

of computational tools that facilitate the translation of the

accumulated information into biological meaningful knowl-

edge. This virtual workbench will allow the functioning of

organisms in health and disease to be analyzed and

comprehended at an unprecedented level of molecular

detail. To accomplish this, considerable endeavors have to

be undertaken to provide the necessary powerful infrastruc-

ture for high-performance computing, sophisticated algor-

ithms, advanced data management capabilities, and-most

importantly well trained personnel to design, maintain, and

use these environments (Fig. 2). The ultimate goal of this

new field should be to evolve biology from a qualitative into

a quantitative science such as mathematics and physics.

Although there are still significant challenges, bioinfor-

matics along with biological advances are expected to have

an increasing impact on various aspects of human health.

Acknowledgements

This work was supported by the Austrian Science Fund

(Grant SFB Biomembranes F718) and the bm:bwk, GEN-

AU:BIN, Bioinformatics Integration Network. Michael

Maurer and Robert Molidor were supported by a grant

from the Austrian Academy of Sciences.

References

Anderle, P., Duval, M., Draghici, S., Kuklin, A., Littlejohn, T.G., Medrano,

J.F., Vilanova, D., Roberts, M.A., 2003. Gene expression databases and

data mining. Biotechniques Suppl., 36–44.

Altman, R.B., Klein, T.E., 2002. Challenges for biomedical informatics and

pharmacogenomics. Annu. Rev. Pharmacol. Toxicol. 42, 113–133.

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990.

Basic local alignment search tool. J. Mol. Biol. 215 (3), 403–410.

Baxevanis, A.D., 2003. The molecular biology database collection: 2003

update. Nucleic Acids Res. 31, 1–12.

Boguski, M.S., McIntosh, M.W., 2003. Biomedical informatics for

proteomics. Nature 422, 233–237.

Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P.,

Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C.,

Gaasterland, T., Glenisson, P., Holstege, F.C., Kim, I.F., Markowitz, V.,

Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-

Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M., 2001.

Minimum information about a microarray experiment (MIAME)-

toward standards for microarray data. Nat. Genet. 29, 365–371.

Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J.,

Abeygunawardena, N., Holloway, E., Kapushesky, M., Kemmeren,

P., Lara, G.G., Oezcimen, A., Rocca-Serra, P., Sansone, S.A., 2003.

ArrayExpress—a public repository for microarray gene expression data

at the EBI. Nucleic Acids Res. 31, 68–71.

Collins, F.S., Green, E.D., Guttmacher, A.E., Guyer, M.S., 2003. A vision

for the future of genomics research. Nature 422, 835–847.

D’haeseleer, P., Liang, S., Somogyi, R., 2000. Genetic network inference:

from co-expression clustering to reverse engineering. Bioinformatics

16, 707–726.

Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J.C., Hernandez-

Boussard, T., Rees, C.A., Cherry, J.M., Botstein, D., Brown, P.O.,

Alizadeh, A.A., 2003. SOURCE: a unified genomic resource of

functional annotations, ontologies, and gene expression data. Nucleic

Acids Res. 31, 219–223.

Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler, G.D., Bittner, M.L.,

Chen, Y., Simon, R., Meltzer, P., Trent, J.M., Boguski, M.S., 1998.

Data management and analysis for gene expression arrays. Nat Genet.

20, 19–23.

Forster, J., Gombert, A.K., Nielsen, J., 2002. A functional genomics

approach using metabolomics and in silico pathway analysis.

Biotechnol. Bioeng. 79, 703–712.

Gardiner-Garden, M., Littlejohn, T.G., 2001. A comparison of microarray

databases. Brief Bioinform. 2, 143–158.

Gene Ontology Consortium, 2001. Creating the gene ontology resource:

design and implementation. Genome Res. 11, 1425–1433.

Janssen, P., Audit, B., Cases, I., Darzentas, N., Goldovsky, L., Kunin, V.,

Lopez-Bigas, N., Peregrin-Alvarez, J.M., Pereira-Leal, J.B., Tsoka, S.,

Ouzounis, C.A., 2003. Beyond 100 genomes. Genome Biol. 4,

402–402.

Kanehisa, M., Bork, P., 2003. Bioinformatics in the post-sequence era. Nat

Genet. 33 Suppl., 305–310.

Liebman, M.N., 2002. Biomedical informatics: the future for drug

development. Drug Discov. Today 7, 197–203.

Mancinelli, L., Cronin, M., Sadee, W., 2000. Pharmacogenomics: the

promise of personalized medicine. AAPS PharmSci. 2 (1), E4.

McCabe, E.R., 2002. Translational genomics in medical genetics. Genet

Med. 4, 468–471.

Oosterhuis, J.W., Coebergh, J.W., van Veen, E.B., 2003. Tumour banks:

well-guarded treasures in the interest of patients. Nat. Rev. Cancer 3,

73–77.

Rosell, R., Monzo, M., O’Brate, A., Taron, M., 2002. Translational

oncogenomics: toward rational therapeutic decision-making. Curr.

Opin. Oncol. 14, 171–179.

Shalon, D., Smith, S.J., Brown, P.O., 1996. A DNA microarray system for

analyzing complex DNA samples using two-color fluorescent probe

hybridization. Genome Res. 6, 639–645.

Slonim, D.K., 2002. From patterns to pathways: gene expression data

analysis comes of age. Nat. Genet. 32 Suppl., 502–508.

Spellman P.T., Miller M., Stewart J., Troup C., Sarkans U., Chervitz S.,

Bernhart D., Sherlock G., Ball C., Lepage M., Swiatek M., Marks W.L.,

Goncalves J., Markel S., Iordan D., Shojatalab M., Pizarro A., White J.,

Hubley R., Deutsch E., Senger M., Aronow B.J., Robinson A., Bassett

D., Stoeckert C.J., Jr., Brazma A., 2002. Design and implementation of

microarray gene expression markup language (MAGE-ML). Genome

Biol. 3, pp. RESEARCH00461–RESEARCH00469.

Stein, L., 2002. Creating a bioinformatics nation. Nature 417, 119–120.

Stein, L.D., 2003. Integrating biological databases. Nat. Rev. Genet. 4,

337–345.

Sujansky, W., 2001. Heterogeneous database integration in biomedicine.

J. Biomed. Inform. 34, 285–298.

Thallinger, G.G., Trajanoski, S., Stocker, G., Trajanoski, Z., 2002.

Information management systems for pharmacogenomics. Pharmaco-

genomics 3, 651–667.

Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W., 1995. Serial

analysis of gene expression. Science 270, 484–487.

R. Molidor et al. / Experimental Gerontology 38 (2003) 1031–10361036

Page 93: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Systems for Management of Pharmacogenomic Information

Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski1

Institute for Genomics and Bioinformatics and

Christian Doppler Laboratory for Genomics and Bioinformatics,

Graz University of Technology,

Petersgasse 14, 8010 Graz, Austria

1To whom correspondence should be addressed:

Zlatko Trajanoski, PhD.

Petersgasse 14

A-8010 Graz

Austria

Phone: +43-316-873-5332

Fax: +43-316-873-5340

Email: [email protected]

Page 94: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Abstract

Recent breakthroughs in biological research have been made possible by remarkable

advances in high performance computing and the establishment of a highly sophisticated

information technology infrastructure. This chapter gives an overview of the main and

most important technologies needed for the management of pharmacogenomic

information, namely database management systems, software-, and hardware

architectures. Since pharmacogenomics deals with a great many of public and/or

proprietary data, the most prominent ways for easy storage, retrieval, analysis, and

exchange are presented. Processing these data requires the employment of sophisticated

software architectures. Several most recent practices useful for a pharmacogenomic

environment are explained. Multi-tiered application design and Web services are

discussed and described independent of the major enterprise development platforms.

Since life sciences are becoming increasingly quantitative and state-of-the-art software

architectures utilize a lot of system resources this chapter presents the most recent and

powerful systems for parallel data processing and data storage. Finally, shared and

distributed memory systems and combinations of them as well as different storage

architectures such as Directly Attached Storage, Network Attached Storage, and Storage

Area Network are explained in detail.

Keywords

High performance computing, pharmacogenomic information infrastructure, database,

database management system, software architecture, hardware architecture, web service,

data warehouse, federated database system, parallel processing, data storage.

Page 95: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

1. Introduction There is no doubt that the sequencing and initial annotation of the human genome,

completed in April 2001, is one of the great scientific advancements in history [1, 2].

This breakthrough in biological research was made possible by advances in high

performance computing and the employment of a high sophisticated information

technology infrastructure. High-speed computers are necessary to analyze the tens of

terabytes of raw sequence data and correctly order the 3.2 billion base pairs of DNA that

compose the human genome. The assembly and initial annotation is only the first step on

a long road for understanding the human genome. Many companies, research institutes,

universities and government laboratories are now rapidly moving on to the next steps:

comparative genomics, functional genomics, proteomics, metabolomics, pathways,

systems biology and pharmacogenomics [3, 4]. Latter is the study of how an individual's

genetic inheritance affects the body's response to drugs. Thus it holds the promise that

drugs might one day be tailor-made for individuals and adapted to each person's own

genetic makeup. Environment, diet, age, lifestyle, and state of health all can influence a

person's response to medicines, but understanding an individual's genetic makeup is

thought to be the key to creating personalized drugs with greater efficacy and safety [5].

Researchers are beginning the quest to determine exactly how each gene and protein

functions and more important how they malfunction to trigger deadly illnesses such as

heart disease, cancer, Alzheimer’s and Parkinson’s diseases.

Important prerequisites for pharmacogenomics or personalized medicine will be achieved

by combining a persons clinical data sets with genome information management systems.

However, huge disparate data sources, like public or proprietary molecular biology

Page 96: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

databases, laboratory management systems, and clinical information management

systems pose significant challenges to query and transform these data into valuable

knowledge [6]. The core data are collections of nucleic and amino acid sequences stored

in GenBank [7] and protein structures in the Protein Data Bank (PDB) [8]. Additionally

this core data is used to create secondary and integrated databases such as PROSITE [9]

and InterPro [10]. Furthermore, integrating data collected from high throughput genomic

technologies like sequencing, microarrays, SNP detection, and proteomics require the

nontrivial development of information management systems [11]. For their

establishment, increasingly powerful computers and capacious data storage systems are

mandatory. In the next paragraphs we will give an overview of the main and most

important technologies needed for the management of pharmacogenomic information,

namely database management systems, software-, and hardware architectures.

Page 97: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

2. Databases and Database Management Systems

Since pharmacogenomics deals with a great many of public and/or proprietary data there

is a need to easily store, retrieve, and exchange it. The major problem is the integration of

the steadily increasing heterogeneous data sources.

The most prominent ways to manage and exchange bioinformatics data are:

• Field/value based flat files

• ASN.1 (Abstract Syntax Notation One) files

• XML files

• relational databases

Field/value based flat files have been very commonly used in bioinformatics. Examples

are the flat file libraries from GenBank, European Molecular Biology Laboratory

Nucleotide Sequence Database (EMBL), DNA Data Bank of Japan (DDBJ), or Universal

Protein Resource (UniProt). These file types are a very limited solution, because they

lack referencing, vocabulary control, and constraints. Besides on the file level, there is no

inherent locking mechanism that detects when a file is being used or modified. However

these file types are primarily used for reading purposes.

ASN.1 is heavily used at the National Center for Biological Information (NCBI) as a

format for exporting GenBank data and can be seen as a means for exchanging binary

data with a description of its structure. The access concurrency is like flat files just

manageable at file level, there is no support for queries, and it lacks on scalability. But

since ASN.1 files convey the description of its structure, it thus provides the flexibility

that the client side does not necessarily need to know the structure of the data in advance

[12].

Page 98: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

XML (eXtensible Markup Language) documents are an emerging way to interchange

data and consist of elements that are textual data structured by tags. Additionally XML

documents may include a Document Type Definition (DTD) that describes the structure

of the elements of an XML document. XML files are hence very flexible, human

readable, and provide an open framework for defining standard specifications. For

example the MGED (www.mged.org) and Gene Ontology Consortium

(www.geneontology.org) have adopted XML to provide and exchange data. The

weaknesses of XML are the file based locking mechanism and the large overhead of a

text based format caused by the recurrent content describing tags. Although XML

provides query mechanisms, it lacks scalability because it does not provide scalable

facilities such as indexing [13].

A relational database management system (DBMS) is a collection of programs that

enables to store, modify, and extract information from a relational database. Such a

relational database has a much more logical structure in the way data is stored. Tables are

used to represent real world objects; with each field acting like an attribute. The set of

rules for constructing queries is known as a query language. Different DBMSs support

different query languages, although there is a semi-standardized query language called

SQL (structured query language). One major advantage of the relational model is that if a

database is designed efficiently according to Codd rules [14], there should be no

duplication of any data, which helps to maintain database integrity. DBMS do also

provide powerful locking mechanisms to allow parallel reading and writing without data

corruption.

Needless to say, there are other ways to exchange data like the Common Object Request

Broker Architecture (CORBA) [15]. This standard provides an intermediary object-

Page 99: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

oriented layer which handles access to the data between server and client. Another

recently emerging way to exchange data are web services [16] which will be described

later.

2.1 Data Warehouse and Federated Database System

Genomic management systems allow to query data assembled from different

heterogeneous data sources. They are based on two different approaches:

• Data warehouse

• Federated database system

A data warehouse is a collection of data specifically structured for querying and reporting

[17]. Therefore data has to be imported in regular intervals from sources of interest.

These data constitutes and acts like a centralized repository. Applications can query these

data efficaciously and create reports.

Implemented data marts duplicate content in the data warehouse and allow faster

responses due to much higher granularity of the information. The drawbacks of a data

warehouse are that the timeliness of the content depends on the update interval of the

external data sources. This updates can be very time consuming and may result in higher

storage requirements and operating costs.

Federated database systems overcome these downsides by directly accessing external

data through federated database servers [18]. Integration of external data can be complete

(all data can be accessed) or partial (only information needed is available through the

server). Shortcomings of federated databases are that queries spanning different data

sources at different locations tend to be slow. Due to different query styles, dialects, and

data formats federated database servers are quite complex.

Page 100: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

The Sequence Retrieval System (SRS) [19] initially developed at EMBL and EBI uses an

interesting approach by combining the features of data warehouses and federated

database systems. SRS is on the one hand heavily indexing locally stored genomic flat

file databases and on the other hand it allows to query database management systems on

different sites. An example for a federated approach is the Mouse Federated Database of

the Comparative Mouse Genomics Centers Consortium

http://www.niehs.nih.gov/cmgcc/dbmouse.htm

3. Software Architecture

To meet the requirements of pharmacogenomic data processing systems, a sophisticated

software architecture has to be employed. Less complex tasks like microarray image

analysis or gene expression clustering can be performed on a commonly used

workstation. In this case applications are installed locally on a client machine where all

computational tasks are performed. Required databases are either installed locally or can

be accessed via the local area network (LAN) or the Internet. This kind of direct client-

server access is characteristic for two-tier systems (Figure 1). In a two-tier architecture

the application uses the data model stored in the enterprise information system (EIS), but

does not create a logical model on top of it. All the business logic is packed into the client

application and therefore increased workstation performance is required as soon as the

applications are getting more complex or computational intensive. Furthermore,

applications and database clients have to be deployed and kept up-to-date in order to

adapt to new interfaces on the server side or to add new business logic to the system.

Although there is a technology provided by Sun Microsystems called Java Web Start to

Page 101: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

automate this cumbersome task, only a few software vendors are supporting it. In general,

two-tier software application design is ideal for prototyping, for applications known to

have a short life time, or for systems where the Application Programming Interfaces

(APIs) will not change. Typically, this approach is used for small applications where

development costs as well as development time are intended to be low.

Most of the drawbacks of two-tier architectures can be avoided by moving to a three-tier

architecture (Figure 2) with an application server as central component. In a three-tier

architecture the separation of presentation, business, and data source logic becomes the

principal concept [20]. Presentation logic is about how to handle the interaction between

the user and the software. This can be as simple as a command-line or text-based menu

system, a client graphical user interface (GUI), or a HTML-based browser user interface.

The primary responsibility of this layer is to display information to the user and to

interpret commands from the user into actions upon the business and data source logic.

The business logic contains what an application needs to do for the domain it is working

with. It involves calculations based on inputs and stored data, validation of data coming

from the presentation layer, and figuring out exactly what data source logic to dispatch

depending on commands received from the presentation layer. The data source logic or

EIS is about communicating with other systems that carry out tasks on behalf of the

application, like transaction monitors or messaging systems. But for most applications the

biggest piece of data source logic is a database, which is primarily responsible for storing

persistent data. The usage of a three-tier architecture leads to the following advantages:

• easier to modify or replace any tier without affecting the other tiers

(maintenance)

Page 102: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

• separating the application and database functionality leads to better load

balancing and therefore supports an increasing number of users or more

demanding tasks

• adequate security policies can be enforced within the server tiers without

hindering the clients

The two major enterprise development platforms Java 2 Enterprise Edition (J2EE) and

Microsoft .Net are supporting this kind of software architecture. They can be seen as a

stack of common services, like relational database access, messaging, enterprise

components, or support for web services, that each platform provides to their

applications. With this knowledge in the back of one's mind, the question which platform

to use can be answered based on the expertise of the team members, their preferences,

and based on the existing hardware and software infrastructure.

The next step in the evolution of distributed systems are web services. The concept

behind is to build applications not as monolithic systems, but as an aggregation of smaller

systems that work together towards a common purpose. Web services are self-contained,

self-describing, modular applications that can be published, located, and invoked across

the Web [21]. Web services communicate using HTTP and XML and interact with any

other web service using standards like Simple Object Access Protocol (SOAP), Web

Service Description Language (WSDL), and Universal Description Discovery and

Integration (UDDI) services, which are supported by major software suppliers. Web

services are platform independent and can be produced or consumed regardless of the

underlying programming language. The main limitations of web services are the network

speed and round trip time latency. An additional limitation is the use of SOAP as the

Page 103: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

protocol, since it is based on XML and HTTP, which degrades performance compared to

other protocols like CORBA.

4. Hardware

Life science is becoming increasingly quantitative as new technologies facilitate

collection and analysis of vast amounts of data ranging from complete genomic

sequences of organisms to three-dimensional protein structure and complete biological

pathways. As a consequence, biomathematics, biostatistics and computational science are

crucial technologies for the study of complex models of biological processes. The quest

for more insight into molecular processes in an organism poses significant challenges on

the data analysis and storage infrastructure. Due to the vast amount of available

information, data analysis on genomic or proteomic scale becomes impractical or even

impossible to perform on commonly used workstations. Computer architecture, CPU

performance, amount of addressable and available memory, and storage space are the

limiting factors. Today, high performance computing has become the third leg of

traditional scientific research, along with theory and experimentation. Advances in

pharmacogenomics are inextricably tied to advances in high-performance computing.

4.1 Parallel Processing Systems

The analysis of the humongous amount of available data requires parallel methods and

architectures to solve the computational tasks of pharmacogenomic applications in

reasonable time [22]. State of the art technology comprises three different approaches to

parallel computing:

Page 104: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

• Shared memory systems

• Distributed memory systems

• Combination of both systems

4.1.1 Shared Memory Systems

In shared memory systems multiple processors are able to access a large central memory

(e.g. 16, 32, 64GBytes) directly through a very fast bus system (Figure 3). This

architecture enables all processors to solve numerical problems sharing the same dataset

at the same time. The communication between processors is performed using the shared

memory pool with efficient synchronization mechanisms making theses systems very

suitable for programs with rich inter-process communication. Limiting factors are the

relative low number of processors that can be combined and the high costs.

4.1.2 Distributed Memory Systems

In general, these systems consist of clusters of computers, so called nodes, which are

connected via a high-performance communication network (Figure 4). Using commodity

state-of-the-art calculation nodes and network technology, these systems provide a very

cost efficient alternative to shared memory systems for dividable, numerical

computational intensive problems that have a low communication/calculation ratio. On

the contrary, problems with high inter-processor communication demands can lead to

network congestion, which is decreasing the overall system performance. If more

performance is needed, this architecture can easily be extended by attaching additional

nodes to the communication network.

Page 105: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

4.1.3 Grid Computing

Grid computing is an emerging technology, poised to help the life science community

manage their growing need for computational resources. A compute grid is established by

combining diverse heterogeneous high performance computing systems, specialized

peripheral hardware, PCs, storage, applications, services, and other resources placed over

various locations into a virtual computing environment. For every numerical problem the

appropriate computing facility in a world wide resource pool can be harnessed to

contribute to its solution. A computing grid differs from the earlier described cluster

topology mainly by the fact that there is no central resource management system. In a

grid every node can have its own resource management system and distribution policy.

Grid technologies promise to change the way complex life science problems are tackled

and help to make better use of existing computational resources [23]. Soon, a life scientist

will look at the grid and see essentially one large virtual computer resource built upon

open protocols with everything shared: applications, data, processing power, storage, etc,

all through a network.

4.1.4 Partitioning

In order to use the parallel features of a high performance computing facility, the

software has to meet parallel demands, too. A numerical problem that has to be solved in

parallel must be divided into subproblems that can be subsequently delegated to different

processors. This partitioning procedure can be done either with so-called domain

decomposition (Figure 5) or functional decomposition (Figure 6).

The term domain decomposition describes the approach to partition the input data and to

process the same calculation on each available processor. Most of the parallel-

Page 106: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

implemented algorithms are based on this approach dividing the genomic databases into

pieces and calculating e.g. the sequence alignment of a given sequence on a subpart of

the database. The second and simplest way to implement the domain decomposition on a

parallel computing system is to take sequentially programmed applications and execute

them on different nodes with different parameters. An example is to run the well known

BLAST [24] with different sequences against one database by giving every node another

sequence to calculate. This form of application parallelization is called swarming and

does not need any adaptation of existing programs.

On the other hand functional decomposition is based on the decomposition of the

computation process. This can be done by discovering disjoint functional units in a

program or algorithm and sending these subtasks to different processors (Figure 6).

Finally in some parallel implementations combinations of both techniques are used, so

that functional-decomposed units are calculating domain-parallelized sub-tasks.

4.2 Data Storage

Drug discovery related data storage and information management requirements are

doubling in size every six to eight months, more than twice as fast as Moore’s Law

predictions for microprocessor transistor counts. For life science organizations, data is

necessary, but not sufficient for organizational success. They must generate information –

meaningful, actionable, organized, and reusable data. Data must be stored, protected,

secured, organized, distributed, and audited, all without interruption.

State of the art storage architecture comprises the following solutions:

• Directly attached storage (DAS)

Page 107: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

• Network attached storage (NAS)

• Storage area networks (SAN)

• Internet SCSI (iSCSI)

4.2.1 Directly Attached Storage

This historically first and very straightforward method can be seen today in every PC:

hard disks, floppy disks, CD-ROM or DVDs are attached directly to the main host using

short internal cables. Although in the mainframe arena storage devices, hard disks or tape

drives are separate boxes connected to a host, this configuration is from a functional

perspective equivalent to standard PC technology. DAS is optimized for single, isolated

processor systems and small data volumes delivering good performance at low initial

costs.

4.2.2 Network Attached Storage

NAS is defined as storage elements which are connected to a network providing file

access services to computer systems. These devices are attached directly to the existing

local area network (LAN) using standard TCP/IP protocols. NAS systems have intelligent

controllers built in, which are actually small servers with stripped operating systems, to

exploit LAN topology and grant access to any user running any operating system.

Integrated NAS appliances are discrete pooled disk storage subsystems, optimized for

ease-of-management and file sharing, using lower-cost, IP-based networks.

Page 108: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

4.2.3 Storage Area Networks

A SAN is defined as a specialized, dedicated high-speed network whose primary purpose

is the transfer of data between and among computer systems and storage elements. Fibre

Channel is the de facto SAN standard network protocol, although other network

standards like iSCSI could be used. SAN is a robust storage infrastructure, optimized for

high performance and enterprise-wide scalability.

4.2.4 Internet SCSI (iSCSI)

SCSI is a collection of standards which define I/O buses primarily intended for

connecting storage subsystems or devices to hosts through host bus adapters. iSCSI is an

new emerging technology and is based on the idea of the encapsulation of SCSI

commands in TCP/IP (most widely used protocol to establish a connection between hosts

and exchange data) packages and sending them through standard IP based networks.

With this approach iSCSI storage elements can exist anywhere on the LAN and any

server talking the iSCSI protocol can access them.

Page 109: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

5. Conclusion

A pharmacogenomic data management system has to combine public and proprietary

genomic databases, clinical data sets, and results from high-throughput screening

technologies. Currently the most important public available biological databases require

disk space in the magnitude of one Terabyte (1000 Gigabyte). Considering the

exponential growth of data, it can be expected that the storage requirements for

proteomics will claim Petabytes (1000 Terabyte). Even more, systems for personalized

medicine will be in the range of Exabytes (1000 Petabyte). Assuming that the storage

capacity doubles every year it is imaginable that in ten years working with Petabytes will

be a standard procedure in many institutions. To facilitate the management, handling, and

processing of this vast amount of data, such systems should comprise data mining tools

embedded in a high performance computing environment using parallel processing

systems, sophisticated storage technologies, network technologies, database and database

management systems, and application services. Integration of patient information

management systems with genomic databases as well as other laboratory and patient-

relevant data will represent significant challenges for designers and administrators of

pharmacogenomic information management systems. Unfortunately, the lack of

international as well as national standards in clinical information systems will require the

development of regional specific systems. Additionally all arising security issues

concerning the sensitivity of certain types of information have to be solved in a proper

manner. To accomplish all this stated issues, considerable endeavors have to be

undertaken to provide the necessary powerful infrastructure to fully exploit the promises

of the postgenomic era.

Page 110: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Acknowledgments

The authors express their appreciation to the staff of the Institute for Genomics and

Bioinformatics for valuable comments and contributions. This work was supported by

bm:bwk, GEN-AU:BIN, Bioinformatics Integration Network.

Page 111: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

References

1 Lander E S et al. Initial sequencing and analysis of the human genome. Nature. 409:

860-921 (2001)

2 Venter J C et al. The sequence of the human genome. Science. 291: 1304-1351

(2001)

3 Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of

genomics research. Nature. 422: 835-847 (2003)

4 Forster J, Gombert A K, Nielsen J. A functional genomics approach using

metabolomics and in silico pathway analysis. Biotechnol Bioeng. 79: 703-712 (2002)

5 Mancinelli L, Cronin M, Sadee W. Pharmacogenomics: the promise of personalized

medicine. AAPS PharmSci. 2: E4 -E4 (2000;2(1):E4)

6 Boguski MS, McIntosh MW. Biomedical informatics for proteomics. Nature. 422:

233-237 (2003)

7 Benson D A, Boguski M S, Lipman D J, Ostell J . GenBank. Nucleic Acids Res. 25:

1-6 (1997)

8 Kanehisa M, Bork P . Bioinformatics in the post-sequence era. Nat Genet. 33 Suppl:

305-310 (2003)

9 Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A. The

PROSITE database, its status in 2002. Nucleic Acids Res. 30: 235-238 (2002)

10 Mulder NJ, et al. The InterPro Database, 2003 brings increased coverage and new

features. Nucleic Acids Res. 31: 315-318 (2003)

11 Stein L. Creating a bioinformatics nation. Nature. 417: 119-120 (2002)

12 Steedman D. ASN 1 The Tutorial and Reference. Technology Appraisals,

Twickenham, UK (1993)

Page 112: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

13 Achard F, Vaysseix G, Barillot E . XML, bioinformatics and data integration.

Bioinformatics. 17: 115-125 (2001)

14 Codd E.M. The Relational Model for Data Base Management: Version 2, Addison

Wesley, September 1990

15 Hu J, Mungall C, Nicholson D, Archibald A. Design and implementation of a

CORBA-based genome mapping system prototype. Bioinformatics. 14: 112-120

(1998)

16 Stein LD. Integrating biological databases. Nat Rev Genet. 4: 337-345 (2003)

17 Kimball R. The Data Warehouse Toolkit: Practical Techniques For Building

Dimensional Data Warehouses, John Wiley & Sons, New York, USA (1996)

18 Sheth A P, Larson J A: Federated Database Systems for managing distributed,

heterogenous and autonomous databases. ACM Computing Survey. 22: 183-236

(1990)

19 Zdobnov E M, Lopez R, Apweiler R, Etzold T. The EBI SRS server-recent

developments . Bioinformatics. 18: 368-373 (2002)

20 Fowler M, et al. Patterns of Enterprise Application Architecture. Addison Wesley,

November 2002

21 Thallinger GG, Trajanoski S, Stocker G, Trajanoski Z. Information management

systems for pharmacogenomics. Pharmacogenomics. 3: 651-667 (2002)

22 Buyya R. High Performance Cluster Computing: Architectures and Systems (Vol. 1

& 2). Prentice Hall, NJ, USA. 1999

23 Avery P. Data Grids: a new computational infrastructure for data-intensive science.

Philos Transact Ser A Math Phys Eng Sci. 360: 1191-1209 (2002)

Page 113: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

24 Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. Basic local alignment

search tool. J Mol Biol. 215: 403-410 (1990)

Page 114: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Figure 1: Two-Tier Architecture

In a two-tier architecture the application logic is implemented in the application client, which directly connects to the Enterprise Information System (Database).

Page 115: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Figure 2: Three-Tier Architecture

A three tier architecture enforces the separation of presentation, business, and data tier. This architecture is intended to allow any of the three tiers to be upgraded or replaced independently as requirements change.

Page 116: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Figure 3: Shared Memory Systems

A shared memory system consists of multiple processors that are able to access a large central memory directly through a very fast bus system.

Page 117: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Figure 4: Distributed Memory Systems

In a distributed memory architecture the various computing devices (e.g. PCs) have their own local memory and perform calculations on distributed problems. Input data and results are exchanged via a high-performance inter-process communication network.

Page 118: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Figure 5: Domain Decomposition

Domain or data decomposition is a computational paradigm where data to process is distributed and processed on different nodes.

Page 119: ROBERT MOLIDOR - Genomegenome.tugraz.at/Theses/Molidor2004.pdf · 2015-01-26 · Alexander Sturn, Michael Maurer, Robert Molidor, and Zlatko Trajanoski. Systems for Man-agement of

Figure 6: Functional Decomposition

Functional decomposition divides the computational problem in functional units, which are distributed onto different working nodes processing the same data.