HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/BIDM2003/... · Web viewHaving a...

HAVING A BLAST: ANALYZING GENE SEQUENCE DATA WITH BLASTQUEST –

WHERE DO WE GO FROM HERE?

William G. Farmerie1, Joachim Hammer2, Li Liu1, and Markus Schneider2

University of FloridaGainesville, FL 32611, U.S.A.

AbstractIn this paper, we pursue two main goals. First, we describe a new, user-driven tool, called BlastQuest, for managing BLAST query results. BlastQuest provides interactive, Web-enabled query, analysis, and visualization facilities beyond what is possible by current BLAST interfaces. Specifically, the BLAST results, which are in XML format, are extracted, structured, and stored persistently in a relational database to support a series of built-in analysis operations that can be used to select, filter, and order data from multiple BLAST results efficiently and without referring to the original result files. In addition, users have the option to interact with the BLAST data through a mask-oriented, non-SQL query interface.

Despite BlastQuest’s recognized benefits for biologists, its functionality is limited in several important ways. The second goal of this paper is to analyze these shortcomings and describe a new concept based on two main pillars. (1) A Genomics Algebra, which provides an extensible set of high-level genomic data types (GDTs) together with a comprehensive collection of appropriate genomic functions, and (2) a Unifying Database, which allows us to integrate and manage the semi-structured contents of publicly available genomic repositories and to transfer these data into GDT values.

1. Introduction

Biologists are nowadays confronted with two main problems, namely the exponentially growing volume

of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing

complexity of biological applications and methods afflicted with an inherent lack of biological

knowledge. As a result, many and very important challenges in biology and genomics are now challenges

in computing and here especially in advanced information management and algorithmic design.

The currently most widely used and accepted tool for conducting similarity searches on gene

sequences is BLAST (Basic Local Alignment Search Tool) [1]. BLAST comprises a set of similarity

search programs that employ heuristic algorithms and techniques to detect relationships between gene

sequences and rank the computed ‘hits’ statistically. An essential problem for the biologist is currently the

processing and evaluation of BLAST query results, since a BLAST search yields its result exclusively in

a textual format (e.g., ASCII, HTML, XML). This format has the benefit of being application-neutral but

at the same time impedes its direct analysis. In this paper, we describe a new powerful tool, called

BlastQuest, for managing BLAST results stemming from multiple individual queries. This tool provides

1 Affiliation: ICBR Molecular Services Division: DNA Sequencing Core, Interdisciplinary Center for Biotechnology Research.2 Affiliation: Deptartment of Computer & Information Science & Engineering.

the biologist with interactive and Web-enabled query, analysis, and visualization facilities beyond what is

possible by current BLAST interfaces. In particular, BLAST results from multiple queries are imported,

structured, and stored in a relational database to support a series of built-in analysis operations that can be

used to select, filter, group, and order these data efficiently and without referring to the original BLAST

result files. In addition, users have the option to interact with the data through a user-tailored, screen-

mask oriented, non-SQL query interface based at a deeper, hidden level on a well-defined subset of SQL.

Section 2 elaborates on the current, main challenges in genomics and emphasizes the need for tools

capable of processing BLAST results. Section 3 describes the main features and data analysis options of

the BlastQuest user interface. In Section 4, we view our BlastQuest system from the system architecture

perspective. Section 5 considers desired improvements to BlastQuest. Specifically, we illustrate why new,

sophisticated concepts, tools, and non-standard database technology, which altogether should lead us far

beyond BLAST technology, are indispensable in order to advance biological and genomic research and

progress. We conclude the section with a description of a new, data model, language, and architecture for

integrating, processing and querying genomic information. Section 6 summarizes the paper.

2. The Challenge of Genomics and Its Effect on Computer Science

Genomics is a biological discipline focused on understanding living organisms at the level of the whole

genome. It goes beyond a gene-by-gene approach and instead takes a global view of the complete genetic

system. Genomic scientists examine the full catalogue of genes, the process that control them, gene inter-

relationships and inter-dependencies, and how the organism responds to changes in environment through

the expression of genetic information. In order to illustrate the challenges faced by scientists in this field,

we first review the most important concepts underlying gene sequencing.

2.1. Gene Sequencing

DNA is an information storage macromolecule to encode all of the heritable information passed from

generation to generation of living organisms. In biological systems, genetic information flows from DNA

(genes) to proteins, which are the molecules responsible for mediating or catalysing biological processes.

In other words, inherited information is selectively converted into active biomolecules in response to

changing environmental conditions or demands. The molecular information pathway from gene to protein

goes through an intermediate class of molecules known as messenger RNA (mRNA). The synthesis of

mRNA is known as transcription, and the conversion of mRNA into protein is a process known as

translation. Both transcription and translation are important regulatory steps used to control which

genetic information is expressed, and when and where protein molecules will be made by the cell. The

constellation of mRNA molecules in a cell at any moment represents the expressed genome. The

2

expressed genome is also referred to as the transcriptome. Identifying all the genes present in the

transcriptome effectively infers the proteins being utilized by the cell (also known as the proteome) and

essentially defines the current biochemical process of the cell. While characterizing the global cellular

proteome would be most direct and informative, this is not possible using currently available technology.

Instead genomics scientists use high throughput DNA sequencing to characterize the genome and the

transcriptome. Genome sequencing involves determining the nucleotide sequence of extensive

chromosomal regions or in some cases a complete nucleotide sequence of the whole genome.

Characterization of the transcriptome on the other hand involves full or partial sequence characterization

of mRNA molecules. Partial sequences of mRNA molecules are known as Expressed Sequence Tags

(EST) sequences. While the process of DNA sequencing is routine, nucleotide sequences do not directly

reveal their biological meaning or function. The possible biological function of a gene sequence must be

determined either through direct empirical experimentation, or more often through inferencing of gene

function using nucleotide sequence homology searches of gene databases such as GenBank [5].

2.2. Gene Homology Searches

Gene homology searches most often use the BLAST algorithm [1]. The BLAST search engine takes a

query nucleotide sequence and searches it against the database for entries matching the query. The

BLAST algorithm calculates statistical scores (bit scores and e-values) making real sequence homology

matches easier to distinguish from matches that might happen by chance. Other information included in

the BLAST result includes a short text string summarizing the biological properties of the database

match, and several unique identification numbers, the GI Number (unique ID for Genbank records) and

Accession Number, linking the matched sequence back to the GenBank database and to additional

information stored in the full database record. Each nucleotide query sequence submitted to the BLAST

search engine returns as few as zero (no matching homologous sequence) to hundreds of matching

database records. Results of BLAST searches are usually interpreted by reviewing the text output.

However, large-scale genomics projects often generate tens of thousands of nucleotide sequences

and the prospect of manually manipulating, summarizing, and interpreting the thousands of BLAST

output files is impractical at best. Scientists facing this informatics challenge may become discouraged or

might overlook important information because they simply cannot find it. Clearly, methods or tools are

needed to help manage the process of identifying and evaluating unknown nucleotide sequences and the

sometimes-overwhelming information obtained in large-scale nucleotide sequence homology searches.

3

2.3. BlastQuest as an Answer to Tool Requirements from a Biologists Perspective

The Biotechnology Program at the University of Florida through the Interdisciplinary Center for

Biotechnology Research (ICBR) operates a Genomics Core facility with the personnel and equipment to

carry out large-scale DNA sequencing projects. The ICBR Genomics Core provides DNA sequencing

services to campus researchers as well as biological scientists and their collaborators located throughout

the world. A typical DNA sequencing project involves collections of several hundred to tens of thousands

of DNA sequences. As outlined in Section 2.2, nucleotide sequence homology searches are frequently the

first step toward identifying the biological function of unknown nucleotide sequences. Most university-

based investigators lack the computational expertise and infrastructure to initiate and manage BLAST

homology searches on the hundreds or thousands of nucleotide sequences generated by their projects.

Biological scientists want to gain insight from their data without first having to overcome the

management of their data.

With this in mind, there has been a clear need to build a centralized system to manage BLAST

results. The BlastQuest project was initiated to help with the challenge of managing BLAST results and

make this information available in a web-based interface accessible to client researchers located anywhere

with internet access. It began with several modest goals, foremost the delivery of a web-based tool for

viewing, searching, filtering, and summarizing large numbers of BLAST results files. Our solution began

with asking our user community for ideas about the types of analysis they would like to perform. The

result of these interviews drove our initial list of functional requirements for the BlastQuest system:

A BLAST results viewing tool accessible to research groups at remote locations . Users should

have access to their BLAST results from anywhere on the Web including the ability to share

results with colleagues in other locations.

Selective browsing of BLAST homology search results. As a first step, biologists want a broad

overview of the possible biological functions of the many genes sequences represented in their

DNA sequence data. The ability to reduce and summarize BLAST data to only the most

significant results is initially very informative.

Search capability on a variety of criteria, such as text terms on biological properties or gene

functions. As biological scientists identify their most interesting gene sequences they need a

way to focus and retrieve only those search results related to the precise topic of interest.

Selective data filtering on various BLAST statistical criteria such as e-value or bit score .

These statistical parameters help discriminate between real sequence homology matches and

matches that might happen by chance. There are no hard limits to the significance of these

statistical parameters. The user will choose parameters giving either a more relaxed or

restricted view as needed.

4

Selective data grouping on criteria such as GI number, or a defined number of top-scoring

results. For example, viewing the three statistically best-scoring results for each query sequence

is a convenient way to summarize and browse BLAST results for many query sequences.

Grouping query sequences by GI number collects all of the query sequences having sequence

homology matches with the same sequences from the database. Two or more query sequences

sharing the same database homology match imply the query sequences are related to each other

and suggest additional analysis of the relationship is warranted.

Privacy constrained sharing of results among the scientists. DNA sequence data is often

proprietary and may constitute intellectual property. Such data should not be made public until

properly protected.

A convenient interface for getting queries into and BLAST results out of the system . The

interface must be attractive and logically implemented so users will be able to find and use the

tools the system provides.

To the best of our knowledge, the functionalities of WebBLAST 2.0 [3] and the Ontario Center for

Genomic Computing OCGC BLAST [2] match some of our requirements but fall short in several

important aspects. For example, there is no provision in WebBLAST for applying global filtering and

grouping operations, or a mechanism for searching all BLAST results on user-supplied text terms. The

OCGC BLAST results manager appears closest to BlastQuest in functionality, allowing selected viewing

and data filtering on up to five criteria. However, OCGC BLAST is not available to genomics scientists

outside of the Province of Ontario, Canada. The BlastQuest Project is designed to meet our immediate

specific requirements, but most important, provide a platform we might freely modify to test our notions

of Genomics Algebra (Section 5.2), an advanced query language for biological information.

3. BlastQuest User Interface

BlastQuest simplifies large-scale analysis in gene sequencing projects by providing scientists with a

means to filter, summarize, sort, group, and search BLAST output data. BlastQuest extracts gene data

from XML files, which are returned as the result of homology searches from BLAST engines, and stores

them in an underlying relational database. This allows the user to benefit from well-known relational

concepts like transactions, controlled sharing, and query optimization. Finally, BlastQuest also allows

users to perform homology searches of their proprietary sequence data against public domain data, such

as NCBI databases, etc.

The most frequently used user operations are hard-wired in the user interface and accessible via

command buttons. Their execution rests on SQL that is hidden from the user. To enable data analysis that

is not directly supported by the built-in user interface operations, BlastQuest offers a more flexible, mask-

5

oriented, and especially non-SQL query interface since biologists object to SQL due to its complexity and

low-level abstraction. This interface essentially allows the user to construct complex boolean expressions

as selection conditions which include logical operators and substring search predicates. The underlying

query execution is based on parameterized SQL queries, which are instantiated and automatically

translated into executable SQL code by the DBMS.

Another interesting feature of BlastQuest is that it can be linked to the so-called SMART (Simple

Modular Architecture Research Tool [6]). The integration of BlastQuest output into SMART is in direct

response to the desire by scientists for new tools and interfaces capable of accessing and integrating

external resources into one system. In Section 5, we describe our plans to develop a Genomics Algebra

query software that operates on a unifying database whose contents can include data from existing

genomics repositories.

Finally, BlastQuest enables to manage BLAST data on a per-project or per-user basis using the

security features of the underlying database while at the same time allow controlled sharing of this data

in order to support collaboration. A startup page facilitates the extraction of gene data from original,

external BLAST files into a MySQL database. Due to the large volume of data, a simple page-by-page

viewing is not helpful to the user but selection mechanisms are needed to find the data of interest. The

overall strategy is to apply a sequence of consecutive operations on the data to gradually approach the

data of interest. In the following we describe the main user interface features for doing this.

The first feature is to let BlastQuest create a summary page for selected sequence segments.

Users require this high level summarization of their sequences because the volume of BLAST output data

for large-scale sequencing project is well beyond simple page-by-page viewing. This summary page gives

an abbreviated overview of each query sequence with possible function. For each query DNA sequence,

only the sequence database match with the best statistical score calculated by BLAST is displayed with a

summary of important biological information, usually text terms describing a gene or protein name, and

sometimes including possible biological functions. The summary page also contains, for each matching

sequence, the GenBank sequence ID, gene definition, and expect value.

The second feature is user-controlled selection. Unfortunately, the statistically calculated ranking

of matching sequences provided by BLAST does not necessarily correspond to the biological knowledge

and experience of the user. The user may apply their biological knowledge or insight to tag a different

result as better for expressing the possible function of the query sequence. By manually selecting a

specific query result, the user can get additional information such as the percentage of identity, or

alignment of the query sequence and the matching sequence. Even a detailed display of sequence

alignments is available, which is identical to the free-text formatted BLAST result to which most BLAST

users are accustomed.

6

The third feature is related to built-in selection facilities, which can be activated by a mouse-click

and operate on all query sequences and their query results. Examples are the displays of hits with expect

values less than a particular threshold by selecting from a pull-down menu (e.g., shown later in Figure 2),

or restricting the display to the best n database matches for each query sequence. All filtering facilities

together give researchers the ability to adjust their analysis process to the particular research focus,

project status, and prior knowledge of query sequences, to reduce the original BLAST result to a

manageable size, and especially to remove results of low quality.

The fourth feature comprises ordering and grouping functions. These help the user to discover

relationships among genes or expression patterns. For examples, there may be more than one sequence or

contig that are derived from different regions of the same mRNA or gene, grouping on GI number will

cluster these related sequences, identifying them for further analysis of their relationship. One of the new

features we are implementing is grouping sequences on UniGene ID. This is an additional step to identify

EST sequences that come from gene orthologs or gene paralogs3. Another example is that biologists

sometimes want to know which sequences have their functions well resolved by blast search, and which

have not. By ordering query sequences by the expect values of top scoring BLAST hits, users identify

sequences with high-quality hits, sequences with only low-quality hits, or even sequences having no hit.

This step rapidly classifies sequences for different types of additional analysis. For example, if the user

asks for grouping on GI number or query sequence, related sequences and their BLAST results are

grouped together rather than appear randomly or out of context. This is also a proven method to identify

EST sequences that come from different regions of the same mRNA, gene orthologs, or gene paralogs4.

The fifth feature enables user-defined, mask-oriented, non-SQL queries. This feature refers to the

problem that the built-in functionality of BlastQuest is sometimes insufficient for specific analysis tasks.

For example, if a user wants to find out which sequences are homologous to genes with reverse

transcriptase function, which is not hypothetical but is proved by empirical data, BlastQuest does not have

built-in selection facilities for this specific query. In fact, we recognize that it is impossible to exhaust

such ad hoc queries. On the other hand, we cannot expect our users to learn SQL in order to query their

results. To solve this problem, BlastQuest provides a Web page which allows the user to interactively and

textually construct complex boolean filter expressions which may include logical operators like “AND”

and “OR” as well as substring search predicates like “Contains” or “Not Contains.” A search field (like

“Hit Definition” in Figure 1) to which the Boolean expression is compared can be selected by a drop-

down menu. Figure 1 shows two textual representations of the same Boolean expression under

3 Gene orthologs are genes that are derived by divergent evolution, such as the -hemoglobin gene from human and from mouse. Gene paralogs are genes that are duplications, such as -hemoglobin and -hemoglobin.4 Gene orthologs are genes that are derived by divergent evolution, such as the -hemoglobin gene from human and from mouse. Gene paralogs are genes that are duplications, such as -hemoglobin and -hemoglobin.

7

construction. The second representation expresses the condition in a way nearer to natural language. The

first representation is a test mode translating the ‘natural language’ condition into SQL. In a later version

the SQL test mode will disappear. The construction of the Boolean expression and hence of the query is

completed by clicking the “Commit” button. BlastQuest assembles the SQL query, sends it to the MySQL

driver, receives the results and displays them. In the example in Figure 1, the user is specifying a query

which will produce matches that contain the word ‘reverse’, but not ‘hypothetical’.

Figure 1: User-defined query construction tool.

The sixth feature to be mentioned is interoperability between BlastQuest and other biological

information systems. Creating links to other systems in order to make use of their specific functionality

becomes more and more important for the biologist. In the context of BlastQuest, after having examined

the query sequences and their probable identities, we wish to derive the protein sequences encoded by the

nucleotide sequence. Rather than translate the nucleotide sequence directly, BlastQuest takes the ‘best’

match, which represents a homologous gene closely related to the unknown query sequence, and retrieves

the corresponding protein sequence as translated by BLAST. After grouping search results by query

sequence (e.g., the best five statistical matches) the user is presented with the screen shown in the top half

of Figure 2. Next, the user checks the ‘amino conversion’ box at the right top of the screen, and the check

box adjacent to the query sequence they wish to translate into an amino acid sequence. When the user

clicks the ‘Details’ button, the ‘Sequence Analysis’ screen shown in the bottom half of Figure 2 appears.

The user may submit the derived protein sequence to the SMART protein analysis Web site by simply

clicking on the amino acid sequence. Results of the SMART analysis will appear in the browser window.

8

Figure 2: Filtering and grouping BLAST results on a project basis.

The seventh and final major feature is the capability to perform BLAST searches against the users’

own sequence database. This functionality gives the user the ability to query their own sequence data with

a specific nucleotide or protein sequence. If a user obtains an interesting sequence from other resources,

internal blast search helps to find out whether s/he owns similar sequences. If similar sequences are

found, the corresponding clone is identified and retrieved from the users clone bank where it may be used

for further experiments. In the example shown in Figure 3, the user pasted the query sequence into the top

text area. The interface also allows input of a sequence file location for uploading. From drop-down

menus, the user chooses from among different BLAST programs and different local target databases that

s/he owns or has a “guest” privilege for. BlastQuest also provides choices for choosing a homology

matrix via a drop-down menu. However, BlastQuest does not provide users to specify other parameters,

such as expect value threshold, word size, gap-open penalty, and gap-extension penalty. The rationale

behind this is that we allow retrieval of all hits with expect values less than 10. Users may filter, order,

group or perform other manipulations later by using the built-in functionalities of the BlastQuest system.

After the user clicks the “BLAST” button, the query sequence is submitted with selected parameters. For

individual blast query, the result will be displayed in HTML format. If the user has “owner” privilege,

s/he can choose to either parse and store this blast output persistently into the MySQL database or delete

9

it when the session ends. For batch queries, BLAST results will be parsed and automatically stored in the

MySQL database automatically for later examination and analysis.

Figure 3: Internal BLAST search user databases.

All operations described here can be combined to analyze data generated in a larger project. For

example, one may use BlastQuest to retrieve hits with expect value lower than 0.05, followed by grouping

on gene ID, and only display the top five matching hits per GI number (as illustrated in Figure 2).

4. Architectural Overview of the BlastQuest System

Figure 4 depicts a conceptual overview of the 3-tiered BlastQuest system architecture. Tier 1 contains the

database backend, which is implemented using the MySQL5 RDBMS. Since BlastQuest is mainly a

proof-of-concept prototype rather than a production-strength system, our choice for a DBMS was

governed by availability of source code and platform compatibility rather than performance and richness

in features. The database backend stores and manages BLAST and PHRAP (Phragment Assembly

Program) [4] results, which are represented as XML and ACE6 (ArChivE) documents and whose structure

has been mapped into the relations Query, Assembly, Hit, and Query_Hit shown later in Figure 5.

5 See http://www.mysql.com/.6 See http://bozeman.mbt.washington.edu/phrap.docs/phrap.html for an example and documentation on the format.

10

http://bozeman.mbt.washington.edu/phrap.docs/phrap.html

Figure 4: Conceptual overview of the BlastQuest system architecture.

For each query sequence submitted to the BLAST server (shown in the upper right-hand corner of Fig. 4),

the relation Hit stores detailed hit information, such as hit definition, expect value, bit score, pairwise

alignments and so forth. For queries, which do not produce a match in the homology search, the fields are

marked as NULL. From a biological point of view, sequences with no homologous sequence match often

lead to new genes and are analyzed in a different manner (outside of BlastQuest). In addition, the

homology search criteria for each BLAST search, such as the BLAST program name, database name,

matrix, and date, are stored in Query_Hit table. These parameters are important to users because for the

same query sequence, BLAST generates different results based on different criteria. For example,

BLASTN results and BLASTX results may indicate different functions for the same query sequence. In

addition, the same BLAST search on different days may generate different hits since BlastQuest’s

BLAST server is regularly updated with the latest version of the NCBI data files. The MySQL database

also stores information about how related gene segments are assembled into single consensus DNA

sequences by PHRAP, which is external to BlastQuest and invoked before the DNA sequence results are

submitted to BLAST. PHRAP outputs its results in an ACE file, which is mapped into the relation

Assembly. If the user considers the results of the BLAST search interesting, s/he may want to extract

the physical clones from which the specific query sequences are generated or assembled. This is possible

by joining the Assembly and Query tables via the “qid” foreign key to retrieve all segments and

corresponding clone names that are clustered into a specific query sequence.

11

Figure 5: Relational Schema of the BlastQuest MySQL database.

The database also maintains information about users and their corresponding gene sequencing

projects, which are stored in the three remaining relations, User, Project, and User_Proj. The

relation User_Proj represents the many-many relationship between scientists and the projects to which

they belong. Since all sequence data is organized by project (using the PID foreign key in relation

Query), BlastQuest provides control over which user has access to which data.

Tier 2 contains the multi-threaded BlastQuest application program, which is divided into five

modules: The BLAST Server, which is used to conduct BLAST searches against NCBI as well as internal

data owned by the users; the client interface module, which handles communication with the Web clients

in tier 1; the two loader modules for extracting and loading data from the XML and ACE input files into

the database; and the SQL constructor for assembling the queries to be sent to the database. The BLAST

Server is downloadable freeware from NCBI. The client interface module is implemented as a series of

Java Server pages (JSPs) that execute inside a Tomcat server. The remaining three modules are

implemented as Java classes. We briefly highlight the functionality of each module.

BlastQuest maintains a local version of NCBI’s “NR” database, which is, updated monthly with

new releases and can be searched with a local copy of the BLAST server (labeled “Blastable Data” in

Figure 4). In addition to public domain data, this local BLAST database also contains blastable data from

each user’s proprietary query sequences. The conversion of query sequence data into blastable data is

done using the “formatdb” program provided with NCBI’s BLAST search engine.

The XML loader parses each BLAST result file into a Document Object Model (DOM)

representation using the Xerces Java Parser 1.4.4. The XML loader then extracts the relevant data items

needed to populate the Hit and Query_Hit tables. Specifically, the loader module contains several

12

Underlined attributes denote the unique identifiers (primary keys) for each relation. Attributes with a superscript are foreign keys (superscript denotes the referenced relation).

classes whose data structures correspond to the tables in the database schema. When the loader collects

data from an XML file, it populates the appropriate class objects with the extracted values. At the end, the

objects are passed to the SQL constructor, which creates the SQL commands to insert the values into the

relational database. The ACE loader works in a similar fashion. However, since there was no standard

ACE parser available, we created our own. Our event-based parser detects the presence of certain

keywords in the ACE input file and extracts the information associated with that keyword. Other, more

efficient loading options are possible, for example, by using the bulk loading utilities of the DBMS.

However, by making our loader modules part of the Web-based middleware, users can load BLAST

results into their BlastQuest accounts from anywhere on the Web.

The SQL constructor is the gateway between the database and the middleware. It connects to the

MySQL relational database engine via the JDBC driver, which is part of the Java developer’s kit, and is

supported as a part of MySQL’s call level interface. In addition to creating the SQL load commands, the

SQL constructor translates commands from the user interface into SQL queries, which can be executed by

MySQL. Analogously, it processes the resulting record sets and creates the Java objects that are used by

the client interface to generate the Web pages.

Another important function of the SQL constructor is the management of connections to the

relational database engine. Each time a query is to be processed, a database connection is required.

Establishing a JDBC connection with a relational database is relatively time consuming and may take as

much as two seconds7. Since it is anticipated that queries from multiple clients will need to be

simultaneously supported, there is a need to manage the limited number of database connections

carefully. Because of the overhead required to establish a new database connection, the SQL constructor

maintains a “pool” of open database connections. Each time a client needs to process a query, it requests

a connection from the pool, eliminating in most cases the need to re-establish and wait for a database

connection. It is important to note that BlastQuest is only supporting read-queries, avoiding the need to

support updates and the overhead of transaction management inside the MySQL database.

Tier-3 is a (thin) client interface, which is implemented as dynamic Web pages displayed inside a

Web browser. Client-side processing is limited to validation of user input, submitting requests to the

BlastQuest application and displaying HTML results.

5. Planned Improvements

The BlastQuest system described above has been successfully employed and testet by scientists in the

gene-sequencing lab at ICBR and by several of their clients for over six months; the feedback from the

7 While two seconds does not appear to be a long time, consider that most commercial database engines can process queries against tens of thousands of tuples in less than one second!

13

clients has been positive. However, we also received important feedback regarding the limitations of the

current system. For example, there is a desire for additional, more sophisticated analysis functionality and

the ability to integrate data from external repositories.

5.1. User Requirements

As a starting point for the development of a more sophisticated management system for genomics data,

we have identified all of the biological needs that are currently not supported in BlastQuest 8. In the

interest of space, we provide the readers with an overview of the most important ones:

1. The ability to query, search and analyze data from external genomics repositories (in addition

to those accessible through BLAST). An extension of this is the ability to integrate related

results from multiple repositories in a meaningful manner, for example, to fill in missing values

or correct inconsistencies that exist across different repositories.

2. A representation of the genomics data that is semantically richer than the current textual

representation provided by BLAST and most other repositories. For example, BLAST query

results are more or less collections of textual strings and numerical values and are not expressed

in biological terms such as genes, proteins, and nucleotide sequences. As a result, BLAST and

BlastQuest operations are limited to basic string manipulation (e.g., shortest common substring)

rather than high-level, gene-specific operations such as transcribe, translate, etc.

3. Integration of new specialty evaluation functions. The possibility to evaluate data from BLAST

results as well as self-generated data with publicly available methods is insufficient. Thus, it

must be possible to create, use, and integrate user-defined functions that are capable of

operating on both kinds of data into the analysis interface of the tool. However, this requires an

extensible database management system, query language, and user interface, which is currently

not part of BlastQuest.

4. The ability to create and store new knowledge. A biologist generates new biological data from

their own research or experimental work, for example, by analyzing BLAST results. Hence,

scientists have expressed a strong desire to store and manage this newly created knowledge

together with the source data. For example, there is a need to annotate data in BLAST results

and to store the annotations persistently so that they can be re-used (e.g., by linking a record in

a new BLAST result to an existing annotation in the repository).

8 In fact, based on a survey of the related literature, we have found that most of the existing integration and management systems for genomics data such as K2/KLEISLI (http://db.cis.upenn.edu/K2/), Tambis (http://imgproj.cs.man.ac.uk/tambis/index.html), SRS (http://srs.ebi.ac.uk), etc. only support some of the functionality described in this list.

14

5. Support for controlled collaboration among multiple scientists. It is of great value for scientists

to share some of the their findings in a controlled manner with colleagues. For example, it

should be possible among the users of a genomics repository system to grant write access to

some of the annotations but read-only or no access to others.

6. The ability to connect DNA sequence identities inferred from BLAST results with gene-

associated biological functions described through the efforts of the Gene Ontology (GO)

Consortium [7]. This type of cross-referencing is the best way to describe the functionality of a

newly, discovered gene. This functionality will help biologists to annotate and catalog the

genes by universally accepted GO IDs and hence help them to discover new genes.

5.2. The New Approach

Based on previous list of needs, which illustrates the complexity of the information-related challenges

that confront biologists and computer scientists, we decided to redesign our current system from the

ground up. For example, to provide users with a semantically rich representation of the genomics data as

well as support for specialty functions (needs 2 and 3 above), requires the design of a new data type

system and operations, which must be integrated with the underlying database management system for

efficient query processing and persistence. Another example, access to multiple genomics repositories

(need 1), requires the ability to extract, translate, and reconcile heterogeneous data from multiple sources

and store the integrated result using a global schema, which has been constructed either from the local

schemas of the sources or based on general knowledge of the domain.

In response to our requirements analysis, we are developing a new genomics integration and

management system that is based on two fundamental pillars: (1) A Genomics Algebra software system to

provide an extensible set of high-level genomic data types (GDTs) (e.g., gene, protein, nucleotide)

together with a comprehensive collection of appropriate genomic functions (e.g., translate, transcribe,

decode). (2) A Unifying Database to manage the semi-structured contents of publicly available genomic

repositories and to transfer these data into GDT values. These values then serve as arguments of

Genomics Algebra operations, which can be embedded into a DBMS query language.

5.3. Genomics Ontology

A precondition for a successful construction of our Genomics Algebra is the design of an ontology for

molecular biology and bioinformatics. By ontology, we are referring to “a specification of a

conceptualisation.” That is, in general, an ontology is a description of the concepts and relationships that

define an application domain. Applied to bioinformatics, an ontology is a “controlled vocabulary for the

description of the molecular functions, biological processes and cellular components of gene products”.

15

An obstacle to its unique definition is that the multitude of heterogeneous and autonomous genomic

repositories has induced terminological differences (synonyms, aliases, formulae), syntactic differences

(file structure, separators, spelling) and semantic differences (intra- and interdisciplinary homonyms). The

consequence is that data integration is impeded by different meanings of identically named categories,

overlapping meanings of different categories, and conflicting meanings of different categories. Naming

conventions of data objects, object identifier codes, and record labels differ between databases and do not

follow a unified scheme. Even the meaning of important high-level concepts (e.g., the notion of gene or

protein function) that are fundamental to molecular biology is ambiguous.

If the user queries a database with such an ambiguous term, until now (s)he has full responsibility

to verify the semantic congruence between what (s)he asked for and what the database returned. An

ontology helps here to establish a standardized, formally and coherently defined nomenclature in

molecular biology. Each technical term has to be associated with a unique semantics that should be

accepted by the biological community. If this is not possible, because different meanings or

interpretations are attached to the same term but in different biological contexts, then the only solution is

to coin a new, appropriate, and unique term for each context. Uniqueness of a term is an essential

requirement to be able to map concepts into the Genomics Algebra.

Consequently, one of our current research efforts and challenges is to develop a comprehensive

ontology, which defines the terminology, data objects and operations including their semantics that

underlie genome sequencing. Besides posing such a genomic ontology, a main challenge is to find or

even devise an appropriate formalism for its unique specification.

5.4. Genomics Algebra

The Genomics Algebra is the derived, formal, and executable instantiation of the resulting genomic

ontology. Entity types and functions in the ontology are represented directly using the appropriate data

types and operations supported by our Genomics Algebra. This algebra has to satisfy two main tasks.

First, it has to serve as an interface between biologists, who use this interface, and computer scientists,

who implement it. An essential feature of the algebra is that it incorporates high-level biological

terminology and concepts. Hence, it is not based on the low-level concepts provided by database

technology. Second, as a result, this high-level, domain-specific algebra will greatly facilitate the

interactions of biologists with genomic information stored in our Unifying Database (see Section 5.5) and

incorporating at least the knowledge of the genome repositories. To our knowledge, no such algebra

currently exists in the field of bioinformatics.

Our Genomics Algebra is a domain-specific algebra incorporating a type system for biological data.

Its sorts, operators, domains, and functions will be derived from the genomic ontology developed in the

16

first step. The sorts are called genomic data types (GDTs) and the operators genomic operations. To

illustrate the concept, we assume the following, very simplified signature:

sortsgene, primaryTranscript, mRNA, protein

opstranscribe: gene primaryTranscriptsplice: primaryTranscript mRNAtranslate: mRNA protein

This “mini algebra” contains four sorts or genomic data types for genes, primary transcript, messenger

RNA, and protein as well as three operators transcribe, which for a given gene returns its primary

transcript, splice, which for a given primary transcript identifies its messenger RNA, and translate, which

for a given messenger RNA determines the corresponding protein. We can assume that these sorts and

operators have been derived from our genomic ontology. Hence, the high-level nomenclature of our

genomic ontology is directly reflected in our algebra. The algebra now allows us to (at least) syntactically

combine different operations by (function) composition. For instance, given a gene g, we can

syntactically construct the term translate(splice(transcribe(g))), which yields the protein determined by g.

The Genomics Algebra develops its full expressiveness and usability if it is designed and integrated

as a collection of abstract data types (ADTs) into the type system and query language of a database

system (Section 5.6). ADTs encapsulate their implementation and thus hide it from the user or another

software component like the DBMS. From a modeling perspective, the DBMS data model and the

application-specific algebra or type system are separated. This enables the software developer to focus on

the application-specific aspects embedded in the algebra. Consequently, this procedure supports

modularity and conceptual clarity and permits the reusability of an algebra for different DBMS data

models. It requires extensibility mechanisms at the type system level in particular and at all levels of the

architecture of a DBMS in general, starting from user interface extensions down to new, external

representation and index structures. From an implementation point of view, ADTs support modularity,

information hiding, and the exchange of implementations without changing the interface.

5.5. Unifying Database

The Unifying Database is the second pillar of our integrating approach. By Unifying Database, we are

referring to a data warehouse, which integrates data from multiple genomic repositories. We have chosen

the data warehousing approach to take advantage of the many benefits it provides, including superior

query processing performance in multi-source environments, the ability to maintain and annotate

extracted source data after it has been cleansed, reconciled and corrected, and the option to preserve

17

historical data from those repositories that do not archive their contents. Equally important, the Unifying

Database is also the persistent storage manager for the Genomics Algebra.

The component most visible to the user is the integrated schema. We distinguish between the

portions of the schema that house the restructured and integrated external data (i.e., the entities that store

the genomic data brought in from the sources) and which is publicly available to every user, and those

that contain the user data (i.e. the entities that store user-created data including annotations), which may

be private. The schema containing the external data is read-only to facilitate maintenance of the

warehouse; user-owned entities are updateable by their owners. Separating between user and public space

provides privacy but does not exclude sharing of data between users, which can be controlled via the

standard database access control mechanism. Since all information is integrated in one database using the

same formats and representation, cross-referencing, linking, and querying can be done using the

declarative database language provided by the underlying database management system (DBMS), which

has been extended by powerful operations specific to the characteristics of the genomic data. However,

users do not interact directly with the database language; instead, they use the commands and operations

provided by the Genomics Algebra, which may be embedded in a graphical user interface.

Conceptually, the Unifying Database may be implemented using any DBMS as long as it is

extensible. By extensible we are referring to the capability to extend the type system and query language

of the database with user-defined data types. For example, all of the object-relational and most object-

based database management systems are extensible. We believe our integration of the Genomics Algebra

with the Unifying Database represents a dramatic improvement over current technologies (e.g., a query-

driven integration system connected to BLAST sources) and will cause a fundamental change in the way

biologists will conduct sequence analysis.

5.6. Interaction Between Genomics Algebra and Unifying Database

A conceptual overview of the high-level architecture that integrates the Genomics Algebra with the

Unifying Database is shown in Figure 6. The Unifying Database is managed by the DBMS and contains

the genomic data, which comes either from the external sources or is user generated. The link between the

Genomics Algebra and the Unifying Database is established through the DBMS-specific adapter.

Extracting and integrating data from the external sources is the job of the extract-transform-load (ETL)

tool shown on the right-hand side of Figure 6. User-friendly access to the functionality of the Genomics

Algebra is provided by the GUI component depicted in the top center.

The adapter provides a DBMS-specific coupling mechanism between the ADTs together with their

operations in the Genomics Algebra and the DBMS managing the Unifying Database. The ADTs are

plugged into the adapter by using the user-defined data type (UDT) mechanism of the DBMS. UDTs

18

provide the ability to efficiently define and use new data types in a database context without having to re-

architect the DBMS. The adapter is registered with the database management system at which point the

UDTs become add-ons to the type system of the underlying database.

Figure 6: Integration of the Genomics Algebra with the Unifying Database through a DBMS-specific adapter.

The component responsible for loading the Unifying Database and making sure its contents are up-

to-date is referred to as ETL (Extract-Transform-Load). In our system architecture, ETL comprises four

separate activities:

1. Monitoring the data sources and detecting changes to their contents. This is done by the source

monitors.

2. Extracting relevant new or changed data from the sources and restructuring the data into the

corresponding types provided by the Genomics Algebra. This is done by the sources wrappers.

3. Merging related data items and removing inconsistencies before the data is loaded into the

Unifying Database. This is done by the warehouse integrator.

4. Loading the cleaned and integrated data into the unifying database. This is done by the loader.

As we can see from Figure 6, the ETL component interfaces with a DBMS-specific adapter instead of the

DBMS directly. This adapter, which implements the interface between database engine and Genomics

Algebra, is the only component that has knowledge about the types and operations of the Genomics

Algebra as well as how they are implemented and stored in the DBMS.

6. Conclusion

In this paper we have described BlastQuest, a Web-based and interactive tool for importing and

persistently storing genomic data from multiple BLAST queries in a relational database, applying DBMS

19

functionality for processing and querying these data, and visualizing them appropriately. Limitations of

the underlying concept, which will inevitably be reached even through some meaningful improvements,

require new concepts and advanced tools.

The Genomic Algebra is a promising approach in this direction. We believe our new approach will

cause a fundamental change in the way biologists analyze genomic data. No longer will biologists be

forced to interact with hundreds of independent data repositories each with their own interface. Instead,

biologists will work with a unified database through a single user interface specifically designed for

biologists. Our high-level Genomics Algebra will allow biologists to pose questions using biological

terms, not SQL statements. Managing user data will also become much simpler for biologists, since

his/her data can also be stored in the Unifying Database and no longer will s/he have to prepare a custom

database for each data collection. Biologists should, and indeed want to invest their time being biologists,

not computer scientists.

From a computer science perspective, our project leverages and extends the benefits and

possibilities of current database technology. In particular, we demonstrate the elegance and expressive

power of modeling and integrating non-standard and extremely complex data by the concept of abstract

data types into databases and query languages. In addition, our approach is independent of a specific

underlying DBMS data model. That is, the Genomics Algebra can be embedded in a relational, object-

relational, or object-oriented DBMS as long as it is equipped with the appropriate extensibility

mechanisms. In addition, we believe we will gain valuable knowledge about the design and

implementation of new, sophisticated data structures and efficient algorithms in the non-standard

application field of biology and bioinformatics.

References

20

HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/BIDM2003/... · Web viewHaving a...

Documents

Transcript of HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/BIDM2003/... · Web viewHaving a...