(near term) Develop Database Requirements to Yield Schema and Interfaces
description
Transcript of (near term) Develop Database Requirements to Yield Schema and Interfaces
1. (near term) Develop Database Requirements to Yield Schema and Interfaces
2. MoBIoS: Database Management for Data in Metric Spaces
Daniel P. Miranker
Univ. of Texas
What we know for sure: Exploit Commodity Architecture
DB
Curating New Content
Computing GridWebApp
Server
External Data/DB Sources
Users
Repository Schema and Interface Definitions
Issue:
• Database organization and data interchange should be addressed simultaneously
• Once established, difficult to change
Best to get this right the first time.
What we know for sure:
DB Schema
Curating New Content
Computing GridWebApp
Server
1. Data transfer XML & Nexus files2. Curate: (manage quality)
Users
Both 1 & 2 impact schema, (data provenance)
XML and Bioinformatics
• Taxonomic Markup Language (TML)
• PhyloML
• BEAST: Bayesian Evolutionary Analysis Sampling Trees
• AGAVE: Architecture for Genomic Annoation Visualization and Exchange
Answers Start with a Requirements Analysis
• Who
• What
• Why
• How
“Use cases”: specific examples of what is to be accomplish
A Head Start
Requirements of Phylogenetic Databases (with Nakhleh, Barbancon Piel & Donoghue)[BIBE ’03]
• Did a requirements analysis
• Proof of concept for a correctly normalized database schema
1 evolutionary (tree)-edge = 1 row in the database
Who is interested in using Phylogenies?
• Casual Users
• Visualization
• Study Development
• Super-tree algorithms
• Simulation Studies
• Parameter Derivation
• Comparative Genomics
Super-Tree Algorithms Use-Cases
Construct phylogenies by assembling existing studies
Collect those studies by:
• Determine minimum spanning clade for a set of taxa
• Find all phylogenies sufficiently similar to a given phylogeny
Requirements of Phylogenetic Databases
The MoBIoS ProjectMolecular Biological Information System
Daniel P. Miranker
University of Texas
MoBIoS – A Simple IdeaOrganize the Storage Manager Around Metric Space Indexing
Relational Databases
B+ trees 1
dimensional
Spatial Databases
R & K-D trees 2 & 3 dimensions
Metric Databases
VP, M & GNAT trees
No dimensions
Or
very high dimensions
Biological queries conducted with sequential scans.
• Sequence (BLAST)
• Phylogenies (Tree of Life)
• Mass Spectra (Proteomics)
• Ligand Docking (Rational Drug Design)
Metric Space is
• a pair, M=(D,d), where • D is a set of points • d is [metric] distance function with the following
properties:
– d(x, y) = d (y, x) (symmetry)– d(x, y) > 0, d(x, x) = 0 (non negativity)– d(x, y) <= d(x, z) + d(z, y) (triangle inequality)
Can Biology Be Modeled by Metrics?
• Already metrics re:– Phylogenetic trees
– Ligand docking
• First Biologically Effective Metric Model of Amino Acid Substitution [Xu&Miranker 03] In effect, precisely the phylogenetic relationships among
sequences are exploited to form a database index.
• Metrics for proteomic mass-spectra underway
MoBIoS Architecture(Molecular Biological Information System)
phylogenies
First Application (with Randy Linder)
Compared:
{entire Arib. Genome} x {“entire” Rice genome}
To determine conserved pairs of primer pairs,
In O(m log n), will repeat study again soon, faster.
When biological data is put in to an RDBMS
• Primary data is stored in text or blob fields– Annotations may be relational
• Data retrieval – Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST, TreeBASE, Sequest
Organism Function Sequence (BLOB)
Yeast membrane AACCGGTTT
Yeast mitosis TATCGAAA
E. Coli membrane AGGCCTA
Homework: Due tomorrow morning
1. Who are you, (generically)?
2. Use case involving the database
Don’t know: A General Web Service
DB Schema
Curating New Content
Computing GridWebApp
Server
ToL Infrastructure @ SDSC
Computing Grid