Collaborative Bioinformatics: Data Warehouses for Targeted Experimental Results

4
JOURNAL OF INTERFERON AND CYTOKINE RESEARCH 18:799-802 (1998) Mary Ann Liebert, Inc. SPECIAL REPORT Collaborative Bioinformatics: Data Warehouses for Targeted Experimental Results JAMES M. SORACE1-2 and KIP CANFIELD1 ABSTRACT Current functional bioinformatics approaches are handicapped by the inability to store functional data at all or by a scattering of data across heterogeneous databases that are difficult to link and query. The Cellular Response Database (CRD) (http://LHI5.umbc.edu/crd) is designed to store and retrieve data concerning changes in in vitro cellular functions associated with stimuli, such as cytokines and drugs. The database can store a broad range of data, including protein or mRNA expression, as well as functional cellular data, such as apoptosis or adherence. This unique ability to store heterogeneous data using a single data model will min- imize difficulties associated with searching multiple databases. Authors with articles accepted by participat- ing journals are invited to submit data to the CRD. Submission instructions are outlined, along with a review of the CRD's development. INTRODUCTION THE CENTRAL OPPORTUNITY IN BIOINFORMATICS ÍS tO deter- mine how to store information so that future discoveries can be made. To date, published articles are the principal form in which scientific information is exchanged. Keyword search engines, such as Pubmed, allow users to find articles based on Boolean combinations of MESH headings, author, or keyword string searches. More recently, interfaces, such as ENTREZ, al- low users to cross-reference manuscripts with GenBank se- quence entries. As biology moves into the postgenome era, there is a need to develop better systems for the storage, retrieval, and interpretation of biological information. To address some of these issues, we have proposed the Cellular Response Database (CRD) (http://LHI5.umbc.edu/ crd).(1) This database supports queries linking a test cell pop- ulation's expression of proteins and other traits to the exper- imental conditions in which they were measured. This is an advantage over the search strategies outlined above. For ex- ample, keyword searches do not enable a user to clearly spec- ify the context in which a cytokine is used. Thus, does the query "IFN-gamma upregulated" retrieve the genes that IFN- y upregulates, or does it retrieve conditions in which IFN-y is itself increased? The CRD is designed to specifically ad- dress this issue, as well as provide a framework for other more complex information operations. This article familiarizes readers with the background of the CRD and gives detailed directions for data entry. A goal of this effort is to develop an on-line group of collaborators interested in developing bioin- formatics tools to address complex issues of cytokine reg- ulation. CRD OVERVIEW Central to the CRD is the concept of an agent (Fig. 1). For the purpose of this article, the term "agent" is limited to solu- ble molecules (e.g., drugs, cytokines, hormones, or other bio- molecules) that are added to the culture supernatant of a test cell population. An example is to use lipopolysaccharide (LPS) as an agent to induce tissue factor (a procoagulant molecule) production in human macrophages. In this case, LPS is an agent, human macrophages represent the test cell population, and tis- sue factor represents the target gene/protein. However, to model more complicated experiments, the concept of an agent needs to be further defined. Suppose a scientist is testing a drug to determine if it inhibits the production of tissue factor by LPS. Two agents would be used, the drug and LPS, but their roles 'Departments of Information Systems, University of Maryland, Baltimore County, and 2The Department of Pathology and Laboratory Ser- vice, Baltimore VA Medical Center, Baltimore, MD 21201. 799

Transcript of Collaborative Bioinformatics: Data Warehouses for Targeted Experimental Results

Page 1: Collaborative Bioinformatics: Data Warehouses for Targeted Experimental Results

JOURNAL OF INTERFERON AND CYTOKINE RESEARCH 18:799-802 (1998)Mary Ann Liebert, Inc. SPECIAL REPORT

Collaborative Bioinformatics: Data Warehouses forTargeted Experimental Results

JAMES M. SORACE1-2 and KIP CANFIELD1

ABSTRACT

Current functional bioinformatics approaches are handicapped by the inability to store functional data at allor by a scattering of data across heterogeneous databases that are difficult to link and query. The CellularResponse Database (CRD) (http://LHI5.umbc.edu/crd) is designed to store and retrieve data concerningchanges in in vitro cellular functions associated with stimuli, such as cytokines and drugs. The database canstore a broad range of data, including protein or mRNA expression, as well as functional cellular data, suchas apoptosis or adherence. This unique ability to store heterogeneous data using a single data model will min-imize difficulties associated with searching multiple databases. Authors with articles accepted by participat-ing journals are invited to submit data to the CRD. Submission instructions are outlined, along with a reviewof the CRD's development.

INTRODUCTION

THE CENTRAL OPPORTUNITY IN BIOINFORMATICS ÍS tO deter-mine how to store information so that future discoveries

can be made. To date, published articles are the principal formin which scientific information is exchanged. Keyword searchengines, such as Pubmed, allow users to find articles based on

Boolean combinations of MESH headings, author, or keywordstring searches. More recently, interfaces, such as ENTREZ, al-low users to cross-reference manuscripts with GenBank se-

quence entries. As biology moves into the postgenome era, thereis a need to develop better systems for the storage, retrieval,and interpretation of biological information.

To address some of these issues, we have proposed theCellular Response Database (CRD) (http://LHI5.umbc.edu/crd).(1) This database supports queries linking a test cell pop-ulation's expression of proteins and other traits to the exper-imental conditions in which they were measured. This is an

advantage over the search strategies outlined above. For ex-ample, keyword searches do not enable a user to clearly spec-ify the context in which a cytokine is used. Thus, does thequery "IFN-gamma upregulated" retrieve the genes that IFN-y upregulates, or does it retrieve conditions in which IFN-yis itself increased? The CRD is designed to specifically ad-

dress this issue, as well as provide a framework for other morecomplex information operations. This article familiarizesreaders with the background of the CRD and gives detaileddirections for data entry. A goal of this effort is to develop an

on-line group of collaborators interested in developing bioin-formatics tools to address complex issues of cytokine reg-ulation.

CRD OVERVIEW

Central to the CRD is the concept of an agent (Fig. 1). Forthe purpose of this article, the term "agent" is limited to solu-ble molecules (e.g., drugs, cytokines, hormones, or other bio-molecules) that are added to the culture supernatant of a testcell population. An example is to use lipopolysaccharide (LPS)as an agent to induce tissue factor (a procoagulant molecule)production in human macrophages. In this case, LPS is an agent,human macrophages represent the test cell population, and tis-sue factor represents the target gene/protein. However, to modelmore complicated experiments, the concept of an agent needsto be further defined. Suppose a scientist is testing a drug todetermine if it inhibits the production of tissue factor by LPS.Two agents would be used, the drug and LPS, but their roles

'Departments of Information Systems, University of Maryland, Baltimore County, and 2The Department of Pathology and Laboratory Ser-vice, Baltimore VA Medical Center, Baltimore, MD 21201.

799

Page 2: Collaborative Bioinformatics: Data Warehouses for Targeted Experimental Results

800 SORACE AND CANFIELD

A1

(.)An At

(CTn) (?)A1

(.)An Ac

(CTn) (?)Test Cell)ulation

Tes17 N®_®_®_®J I®®®®

Experimental Group Control GroupFIG. 1. Experimental notation for the cellular response database. Agents Al to An are added to identical test cell populationsat time (T) and concentrations (Cx). The test agent At (bold arrow) is added only to the experimental group. The control agentAc (dashed arrow) may be added to the control group. A biologic response between the groups is then measured. (From ref. 1,with permission.)

are different. LPS would be applied to both the experimentaland control groups of human macrophages. The drug would beapplied only to the experimental population, and the differencein tissue factor production between the two groups would bedetermined. To model these more complex experimental de-signs, the CRD recognizes three types of agents.

1. Test agents: These are added only to the test group of theexperiment. In the second example given in the precedingparagraph, the experimental drug is the test agent.

2. Additional agents: These are added under identical condi-tions to both the test and control group of cells. In the sec-

ond example in the preceding paragraph, LPS is an addi-tional agent.

3. Control agents: This represents a group of agents that are

added only to the control population. Representative exam-

ples include control oligonucleotides in antisense experi-ments and control monoclonal antibodies (mAb) if the testagent is also an mAb.

An experiment must have a single test agent. It may, how-ever, have many additional agents, one of which can be a con-

trol agent. In the second example given above, the researchercould adequately describe the experimental findings (includingrelevant controls) by entering the following three assays intothe database.

Assay 1: Contains the data regarding changes in tissue factorproduction in human macrophages with LPS as the test agent.

Assay 2: Contains the data regarding changes in tissue factorproduction in human macrophages with the experimentaldrug as the test agent.

Assay 3: Contains the data regarding changes in tissue factorproduction in human macrophages with the experimentaldrug as the test agent and LPS as an additional agent.

Reference 1 contains additional examples of this notation anda detailed description of the relational CRD data model. Thisreference is available via the Internet.

In addition to the agent notation, the CRD supports the fol-lowing features:

1. It is not limited to a specific type of test cell population or

specific category of agents.2. This database supports a wide variety of user-defined as-

says. For example, a change in a specific gene's expression(e.g., tissue factor) may be measured by methods that detectchanges in either mRNA, protein levels, or activity levels.In addition, general cellular properties, such as viability, ad-herence, or proliferation, may be entered.

3. The CRD allows the pattern of response (i.e., upregulationor downregulation) of the assay to be queried.

4. It can also support the quantitative entry of numerical datafrom a broad variety of detection systems. Single point, dose-response, and kinetic experiments can be supported.

These features allow the system to store the relationship be-tween changes in the level of expression of target mRNAs/pro-teins and changes of cellular functional activity. As biologic re-

search shifts from the reduction of biologic pathways to

understanding their interrelationships, this type of ability willbecome increasingly important.

GENERAL SUBMISSION INSTRUCTIONS

To be useful, a database must be integrated with a simple setof data entry tools so that a critical mass of information can beentered. This section overviews data entry into the CRD. In-vestigators from participating journals are invited to enter datafrom accepted articles. This will allow testing and refinementof the data entry process. A complete description of each dataentry form is available at the website. A useful rule is that levelof data entry should correspond to the tables, figures, and theirlegends describing the experiment in the corresponding report.

The CRD contains two types of tables, library tables and ex-

perimental result tables. The library tables contain data on allagents (e.g., cytokines and drugs), test cell populations, targetproteins/genes, and the methods used to measure the test cellpopulation's response. Data can be entered into these tables us-

ing the corresponding library forms. Once the appropriate in-formation is added to the library tables, it is always availablefor future use and need not be entered again. Before entering

Page 3: Collaborative Bioinformatics: Data Warehouses for Targeted Experimental Results

DATA MODELS FOR BIOINFORMATICS 801

data into the library tables, the user should first consult the web-site's library query menu, as this illustrates the type of datacalled for in data entry and to determine if the information al-ready exists. If an accurate match can be found, the corre-

sponding library form need not be submitted. These queries in-clude Agent Library, Test Cell Library, Target Gene/ProteinLibrary, Reference Library, and Measurement Method Library.

Once the libraries are complete, adding a new assay is a verysimple process. First, enter the general experimental design us-

ing the "Add New Assay: Experimental Design Form." Thisstep is required. Next, if the experiment uses additional agents,this information must be entered using the "Add New Assay:Non-Test/Control Agents Form." Finally, additional data, suchas the time points, test agent concentration, and quantitative re-

sponse values for each experimental point, can be entered us-

ing the "Add New Assay: Quantitative Data Form." Use of thisform is not required but is highly recommended.

EXAMPLE DATA FLOWS

A journal such as this receives many submissions that reportthe results of experiments similar in design to those describedin Figure 1. A researcher who wishes to become familiar withspecific results must deal with the information in the published

article format. This is an efficient mechanism for learning thedetails of experimental results but is not efficient for summa-rizing trends across experimental efforts. This section gives an

example of the data flows and programs needed to create datawarehouses*2-3' that allow researchers to issue focused queriesabout specific experimental results. Once the trends are estab-lished, the researcher may have to go back to the articles forimportant details. The example given here is conceptually use-

ful for creating other data warehouses in similarly structuredareas of experimental effort.

A researcher who has an article accepted for publicationwould be required to submit a subset of the results using a sys-tem such as the CRD. This is a simple task that should not takemore than 20 minutes of the researcher's time. All forms are

web based, and the databases are maintained at the repositorysite. For example, an article reported the loss of transforminggrowth factor-/. 1 (TGF-/31)-induced growth inhibition in hu-man vascular smooth muscle cells derived from atheroscleroticlesions.<4) A screen shot of a major form used in entering thisexample is shown in Figure 2. In a fascinating follow-up arti-cle/5' the same group of investigators demonstrated that thisloss in responsiveness was caused by genetic instability of theA10 microsatellite region of the type II receptor for TGF-/31.This mutation has also been described in malignant cell lines,which are also unresponsive to TGF-/.1.

$. Method of Assay Field Values: Classification, Name, Detects, Response Units, Supplier, Cat. #4. Target Gene/Protein Field Values: Name, Species, G «iBank ID5. Reference field Values: Author, Journal, Vol., Page, Date

Cell Population: [Vascular Smooth Muscle Atherosclerosis.Vascular Smooth Muscle.Human... ;jj_Test Aient [TGF-BBta1.CYrOKINE.HUMAN.R_D Systems "]g

Control Agent: | Nullnone.none.noneAgent Concentration .nits |ng/ml

Method ofAssay: [Cellular Function.[H3] Thymidine Incorporation.DNA Synthersis.DNA Synthesis % Control..____Target Gene/Protein: |NullNona.None 3

Pattern ofChange: | Basal level .[expression unchanged ~3

DataSetType

SINGLEPOINT

KINETIC

Note; [~Date [1/6/98

Reference: McCaffrey TAJ. Clin. lnvest..96.2667.12/95 u

FIG. 2. The "Add New Assay: Experimental Design Form" is shown. The information captured in this screen shot is part ofthe experimental design found in Figure 1 in reference 4.

Page 4: Collaborative Bioinformatics: Data Warehouses for Targeted Experimental Results

802 SORACE AND CANFIELD

After many experiments are entered in this way, researcherscan access this database to

1. Find all cell types tested against TGF-/312. Find all patterns of response for TGF-)31 growth inhibition

in the cell types noted3. For all cell types that lacked TGF-/_l-induced growth inhi-

bition, list all target genes/proteins for which there are ad-ditional data.

No computer technical skills are required by any participantsexcept for those who maintain the database. The queries are

simple and form based. Users need only to access the data en-

try and query forms via the Internet. No software is requiredother than a standard web browser. Initially, data will be storedin a centralized CRD. In the future, there could be repositoriesmaintained for specific areas of interest (by universities or otherorganizations), and journal editors in those areas of interestwould manage the data input. By standardizing on the CRDdata model, a researcher could design queries that would bevalid regardless of the repository's site.

DISCUSSION

Biologic systems contain complex patterns of control. At thecenter of these systems are the structure and activity of the genesencoding proteins that participate in cellular functions. Theseactivities are in turn regulated by protein-protein and protein-ligand interactions. By allowing correlation of mRNA and pro-tein expression with functional assays, such as cellular adher-ence, the CRD will support knowledge acquisition. This willbecome especially important as research shifts from orderingevents of isolated biologic pathways to analyzing interrelation-ships among pathways. The developmental biology communityis a leader in trying to correlate the changes of gene expressionwith functionally complex events, such as embryogenesis'6-7'and neurologic development.'8' The cytokine network, with itspivotal role in disease and normal biology, also represents an

opportunity for this type of effort. In the CRD, the functionalcorrelate is not the developmental progression of an organ butthe response of a defined cell population to the in vitro exper-imental design of the investigator. These data will complementthe in vivo data sets from such efforts.

We anticipate growth of the CRD in several areas. First, thedata entry tools, presented here, will need to be tested and re-

fined. This is a necessary requirement if the database is to bepopulated to a useful extent. Second, the CRD will need to beextended to allow multiple molecular targets to be measuredper assay. An example would be DNA array technology to es-

tablish patterns of mRNA expression. Next, the developmentof data dictionaries and notations that improve data input, es-pecially of functional assay data, will be a priority. Finally, theability to treat transfected genes, and mutations of genes, as a

test agent will further extend the utility of this database.The CRD is very efficient and effective for data reporting of

a specific but important type of experimental design. The kinds

of data warehouses that result will be of great use to researcherswho need trend and summary information from a large groupof experiments. For data entry of other types of experiments,the CRD data model may not hold. We are working on a client-based extensible Markup Language (XML) model for a broaderrange of experiments that can be transmitted to the repositorysite for storage and export to a data warehouse.<3) The XMLobjects are plain text and are easily e-mailed or otherwise trans-ferred to the repository.'9' Once at the repository, because theyare highly structured objects, they can be viewed with special-ized viewers or exported to specialized databases, such as theCRD.

REFERENCES

1. SORACE, J.M., CANFIELD, K., and RUSSELL, S. (1997). Func-tional bioinformatics: the cellular response database. FrontiersBiosci. 2, a31-36. http://www.bioscience.Org/1997/v2/a/sorace/list.htm

2. INMAN, B. (1998). Wherefore warehouse? Byte Magazine 23,88NA1-88NA10.

3. KINBALL, R. (1996). The Data Warehouse Toolkit. New York:John Wiley and Sons.

4. McCaffrey, t.a., consigli, s„ du, b„ falcone, d.j.,SANBORN, T.A., SPOKOJNY, A.M., and BUSCH, H.L. (1995).Decreased type Il/type I TGF-ß receptor ratio in cells derived fromhuman atherosclerotic lesions. J. Clin. Invest. 96, 2667-2675.

5. McCaffrey, ta., du, b., consigli, s., szabo, p., bray,. p.j., hartner, l., weksler, b.b., sanborn, t.a.,

BERGMAN, G., and BUSCH, H.L. (1997). Genomic instability inthe type II TGF-/3I receptor gene in atherosclerotic and restenoticvascular cells. J. Clin. Invest. 100, 2182-2188.

6. RINGWALD, M., DAVIS, G.L., SMITH, AG., TREPANIER, L.E.,BEGLEY, DA., RICHARDSON, J.E., and EPPIG, JT. (1997). Themouse gene expression database GXD. Semin. Cell Dev. Biol. 8,489^197.

7. DAVIDSON, D., BARD, J., BRUNE, R., BURGER, A.,DUBREUIL, C, HILL, W., KAUFMAN, M., QUINN, J.,STARK, M., and BALDOCK, R. (1997). The mouse atlas and graph-ical gene-expression database. Semin. Cell Dev. Biol. 8, 509-517.

8. WEN, X., FUHRMAN, S., MICHAELS, G.S., CARR, D.B.,SMITH, S., BARKER, J.L., and SOMOGYI, R. (1998). Large-scaletemporal gene expression mapping of central nervous system de-velopment. Proc. Nati. Acad. Sei. USA 95, 334-339.

9. MACE, S., FLOHR, U., DOBSON, R., and GRAHAM, T. (1998).Weaving a better web. Byte Magazine 23, 58-68.

Address reprint requests to:Dr. James Sorace

Chief, Blood Bank and Hematology LaboratoriesBaltimore VA Medical Center

10 N. Greene St.Baltimore MD 21201

Fax: (410)605-7911E-mail: jmsorace@ ix.netcom. com

Received 20 April 1998/Accepted 8 May, 1998