NeXML - phylogenetic data as XML
-
Upload
rutger-vos -
Category
Technology
-
view
1.261 -
download
3
description
Transcript of NeXML - phylogenetic data as XML
NexmlA future data exchange standard for
phylogenetics
Rutger Vos
Increased automation in evolutionary informatics is hampered by poorly defined
“standards”
Introduction (1/7)The problem
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Addressing interoperability problems by coding our way out of it
Syntax:Nexml
Semantics:CDAO
Transport:PhyloWS
Introduction (2/7)EvoInfo.nescent.org interests
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (3/7)This subproject’s mission
• To create a file format like nexus*, but:o Fix (some) problems with nexuso Give access to data at higher levelo Be extensibleo Expose data to xml goodies
*Maddison, Swofford and Maddison, 1997. NEXUS: An Extensible File Format for Systematic Information. Syst. Biol. 46(4):590-621
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (4/7)Nexus issues
• Hard/impossible to validate• No explicit versions
o Nothing ever deprecated• No public extensions
o Leads to hacks such as ‘mixed’ data, ‘hot comments’
o Phylogenetics post-’80s in private blocks
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResourceshttps://www.nescent.org/wg_evoinfo/NEXUS_Problems
Introduction (5/7)Parsing plain text versus parsing XML
• Processing nexus data involves lexing + parsing + processing
• XML allows choosing a parser library, data can be processed as a structure that hides tokenization issues
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (6/7)Extensibility
• ‘Extensible’ file format should provide the ability to: o Define new data types that
implement described ‘interfaces’o Attach typed data structures to
core types o Attach custom XML
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (7/7)XML goodies
• Large stack of off-the-shelf tools:o XML parser librarieso Web service toolkitso Native XML databaseso Editors / IDEso Serialization / data binding tools
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (1/5)Design principles
• Re-use of prior art• Follow design patterns• Referencing• Verbose and compact
representations
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (2/5)Re-use of prior art
• Generic key/value attachments using RDFa
• Trees and networks following graphml
• General file structure following nexus concepts, i.e. blocks that reference each other
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (3/5)XML design patterns
• http://www.xmlpatterns.com • “Declare before use”• “Metadata first”• “Venetian blinds”• Abstract inheritance through
extension, concrete inheritance through restriction
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (4/5)Inheritance
IDTagged (required id attribute)
Labelled (optional label attribute)
Annotated (optional dict elements)
Base (optional base/lang/href attributes)
AbstractElement (in root schema)
ConcreteElement (in instance document)
extends
extends
extends
extends
restricts
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (5/5)Referencing
• Elements sometimes refer to other elements, much like in nexus
• In nexml, elements refer to the id of other elements by the name of the referenced element:
<otu id="t1"/> <!-- referenced later: --> <node id="n1" otu="t1"/>
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (1/6)Approach
• Schema design• Community feedback through wiki,
email, telecon, projects (evoinfo, ppod, MIAPA) etc.
• Processors (perl, java, python, c++, javascript, VB) development in parallel
• Experiments with xml tools (ws, db, data binding tools)
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (2/6) Entity relationships
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (3/6)inheritance tree for elements
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (4/6) anatomy of a “block”
<characters id="c1" xsi:type="nex:DnaSeqs" otus="t1">
</characters>
<meta id="m1" datatype="xsd:string” xsi:type="nex:LiteralMeta” property="dwc:catalogNumber" content="12345"/> Contents…
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (5/6)Character Classes
RestrictionCellsRestrictionSeqsRestriction
ContinuousCellsContinuousSeqsContinuous
StandardCellsStandardSeqsStandard
ProteinCellsProteinSeqsProtein
RnaCellsRnaSeqsRNA
DnaCellsDnaSeqsDNA
CellsSequence
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (6/6)Tree Classes
IntTreeFloatTreeTree
IntNetworkFloatNetworkNetwork
IntFloat
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Current status (1/4)Schema blocks
• Done:o OTUso characters: dna, rna,
nucleotide, protein, categorical, continuous, restriction (compact and verbose)
o trees: graphml trees and networks, various edge formats and rootings
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
• Nexml parsers and writers: o Phenexo TreeBASEo Mesquiteo Bio::Phyloo DendroPyo DAMBEo Etc.
Current status (2/4)Parsers and writers
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
• Included schema in soap wsdl• Indexed files in dbxml• Created large files from tolweb,
rbcl• XInclude with tinyseq xml• REST service described using
nexml
Current status (3/4)Experiments
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
• Cross-reference with glossary, ontology
• Substitution model descriptions• Publish standard• Compact trees• Distances• Splits
Current status (4/4)To do
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Resources
NeXML Base URL: http://nexml.org• Wiki: /wiki• Mailing list: /mail• Issue tracker: /tracker • SVN repository: /code
EvoInfo: http://evoinfo.nescent.org CDAO: http://www.evolutionaryontology.org
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Acknowledgements
• Contributions: Jason Caravas, Mark Holder, Peter Midford, Jeet Sukumaran, Xuhua Xia, Chase Miller, Anurag Priyam, Jaime Huerta-Cepas, Matt Yoder, Andrew Hill, Sam Smits, Mike Keesey, Apurv Verma, Mark Jensen
• Feedback: wg-evoinfo, pPOD, Wayne Maddison, David Maddison
• Additional funding, support: NESCent, GSoC