CIPRES DB1 CIPRES Database Focus Group NSF Site Visit June 28, 2006 San Diego.
-
Upload
reynold-weaver -
Category
Documents
-
view
218 -
download
2
Transcript of CIPRES DB1 CIPRES Database Focus Group NSF Site Visit June 28, 2006 San Diego.
CIPRES DB
1
CIPRESDatabase Focus Group
NSF Site Visit
June 28, 2006
San Diego
CIPRES DB
2
Senior Personnel
• Susan Davidson, University of Pennsylvania
• Michael Donoghue, Yale University
• Mark Miller, San Diego Supercomputer Center
• Dan Miranker, UT Austin
• Brent Mishler, UC Berkeley
• William H. Piel, Yale University (TreeBASE II lead)
• Val Tannen, University of Pennsylvania (database focus lead)
CIPRES DB
3
Other (Partially) Funded Personnel
• Lucie Chan, Senior Software Developer, San Diego Supercomputer Center
• Shirley Cohen, Database Developer, then PhD Student, UT Austin, then University of Pennsylvania
• Sarah Cohen-Boulakia, Post-Doc, University of Pennsylvania (not funded by CIPRES)
• Jin Ruan, Senior Software Developer, San Diego Supercomputer Center (TreeBASE II Software Lead)
• Yifeng Zheng, PhD student, University of Pennsylvania.
CIPRES DB
4
Goals of the Database Focus
• The major objective is the development of TreeBASE II
• In addition, this focus has supported related research on– storage/querying of the large phylogenetic trees constructed in
• the Simulation Focus (Davidson, Kim, Zheng)
• the Algorithms Focus of the project (Moret, Hunt, Warnow)
– data provenance in phyloinformatics workflows
(Davidson, Cohen, Cohen-Boulakia)– phylogenetic database extensions using a metric ordering to
support molecular data (Miranker)– genome-scale phylogenetics (Piel)– searching large collections of trees for topological patterns (Piel)
CIPRES DB
5
The current TreeBASE (I)• A 10+ years-old major data resource for biological and
biomedical research– submissions needed to be published in a peer-reviewed scientific
journal before being published in TreeBASE.
• Has been searched from over 60,000 distinct IP addresses• Has accepted over 1,300 submissions that map to over
– 3,700 trees and – 60,000 distinct taxons.
• But the capabilities of the current database are being overtaken by demands.
• CIPRES is developing TreeBASE II as a robust, scalable, and versatile re-design and re-engineering of TreeBASE I.
CIPRES DB
6
TreeBASE I Audience
Researchers from – traditional systematics backgrounds and – molecular biology backgrounds
who are concentrating on a series of focused experiments in the lab.
These users include those who periodically seek online representations of individual phylogenies for research and educational purposes.
CIPRES DB
7
Additional TreeBASE II Audiences (1)
Researchers that want to run meta-analyses on large collections of trees. Examples:
• identifying patterns in trees that result from one type of analysis over another
• visualizing large collections of trees
• studying collaborative networks among phylogeneticists
CIPRES DB
8
Additional TreeBASE II Audiences (2)
Phyloinformaticians who seek to make large-scale inference using synthetic methods applied to large collections of trees. Examples:
• assemble a supertree for a large branch of the Tree of Life
• mine data in search of conflicting phylogenetic signals
• examine the evolution of genes and genomes in a comparative context
CIPRES DB
9
Additional TreeBASE II Audiences (3)
Bioinformaticians who conduct simulation studies.
Frequently, simulation studies use simple models, such as the Kimura 2-Parameter and Jukes-Cantor that are not believed to be biologically realistic.
Finding realistic evolutionary models, using real data, and carrying out simulation studies are some of the main goals of this group.
CIPRES DB
10
Value Added by TreeBASE II
• A phylogenetic query language to allow ``power-users'' to run complex phyloinformatic queries, including on tree topology.
• A robust service layer and LSIDs to allow external tools and services to interface with the database.
• Storage of LSIDs and foreign handles to better integrate with external data services (morphological characters, gene names, taxon names, and museum specimen IDs).
• Taxonomic intelligence for leaf and node labels.
• Ability to store geographic coordinates to support phylogeographic data visualization and analysis.
CIPRES DB
11
Collected Use Cases: Query Examples
Given a set of taxa and a character matrix, find the characters for which the taxa have the same state.
Given a set of taxa and a set of trees, find all trees for which the subtree determined by the taxa (as leaves) is the same.
CIPRES DB
12
TreeBASE II Capabilities: Submission
• Friendlier interface, more features semi-automated
• Support for entering additional (currently non-NEXUS) data such as specimen IDs
• Automated annotations (eg., communication with other sources to retrieve GenBank accession number sequence)
• Better error checking (eg., matching taxon labels between trees and character matrices)
• Assistance features will be opt-in and can be turned off by the user
CIPRES DB
13
TreeBASE II Capabilities: Curation
• Support for interaction with the publication process:– In conjunction with journal submission, study data is submitted to
TreeBASE – It is not made visible to search/query users but reviewers or journal
editors can examine it (anonymous access)– If and when the journal submission is accepted, the study data is made
visible to search/query users
• Support for TreeBASE II editors, examples:– to correct author, citation, or other metadata– to correct the taxon names (alignment between trees and character
matrices or with taxonomic services)– to remove orphan data
An interface with access to taxonomic services such as uBio (www.ubio.org) or the Glasgow Name Server (taxonomy.zoology.gla.ac.uk/rod/rod.html) will be provided to facilitate both submission support and curation capability.
CIPRES DB
14
TreeBASE II Capabilities: Search (1)
2-step configurable GUI retrieving sets of studies, matrices, or trees. – Step 1: choose search criteria– Step 2: choose search
• Study Search By:– Disjunction of conjunctions of author last names– Citation title matches given keyword(s)– Name matches keyword– Contains analysis/analysis step such that:
• Name matches given keyword(s)• Uses given algorithm• Uses given software package• Input and/or output data contains given set of taxa• Input and/or output data contains tree that matches given tree pattern • Input and/or output data contains matrices satisfying given search criteria (same
as below)
CIPRES DB
15
TreeBASE II Capabilities: Search (2)
• Tree Search By:– Tree id number
– Appears in a study satisfying given search criteria (same as above)
– Appears in an analysis/analysis step satisfying given search criteria (same as above)
– Contains given set of taxa
– Matches given tree pattern
• Matrix Search By:– Uses given set of taxa
– Uses given set of character names
– Is a sequence matrix that uses a certain kind of biomolecular information
– Contains given specimen(s)
CIPRES DB
16
TreeBASE II Capabilities: Bulk Queries
XML-based query interface for tools that interoperate with TreeBASE II
• Input: domain-specific query language – based on theTreeBASE Domain Model
– related semantically to a simple subset of SQL or ODMG/OQL
– XML-based syntax
• TreeBASE XML format for query output– Nexus data
– additional data in TreeBASE II
• For the CIPRES tool which is CORBA-based we will use an IDL-to-XML bridge
• Interactive (sophisticated) user can also submit prepared query
CIPRES DB
17
TreeBASE II Domain Model
A detailed object-oriented Domain Model was designed for TreeBASE II (EER diagrams were manually derived from the Domain Model)
A very partial and simplified view:
Study Data
Matrix Tree
Taxon
1
1
1
1
MatrixRow RowSegment Specimen
1
1 1 1
CIPRES DB
18
Technologies used in TreeBASE II development
• Open source• Proven technologies and best practices• Hibernate to generate the SQL schema from the Domain
Model• Hibernate, based on the Domain Model, to program any
database access• Tomcat Web container and one of SDSC's Web farms• Spring framework as an application container to manage
transactions
CIPRES DB
19
Status and Future Plans
• Requirements and use case collection is complete
• The architectural design is complete
• Currently working on detailed design and coding, including GUI work and loading data from TreeBASE I (some is ready)
• A demo will be performed during the site visit
• TreeBASE I data will be loaded by August 2006
• Elements of the interactive user interface will be beta released and end-user tested throughout Fall 2006
• New submissions accepted starting February 2007
• Links to taxonomic services developed in Spring 2007
• Bulk query API, including CIPRES tool interface, developed in 2007
• Available as Web service at end of 2007