Matúš Kalaš AUTHORS, THEN ALSO IN 2 A data model for ...bioxsd.org/bxPoster.pdf · update them...

TECHNOLOGY CHOICES

MOTIVATION

EXAMPLE WORKFLOW

SIMPLE EXAMPLE: BioXSD sequence record COMPLEX EXAMPLE: BioXSD feature record

Matúš Kalaš1, Sveinung Gundersen2, Inge Jonassen1, and the BioXSD and GTrack contributors

blue rectangles are Web-service calls, red ovals are data

Using common data formats makes

workflow construction and maintenance

easier and faster.

a)

Different formats

Without a common format,

using diverse tools in a workflow

demands conversions, “shims”,

or do-it-yourself parsing.

And worst of all, maintaining these

in the future.

The 2 scenarios show demands for

connecting 2 tools (e.g. Web services)

that use:

a) Different formats

b) A common format

b)

Common format

Smooth!

Example data instance, BioXSD in XML: In BioJSON: In BioYAML:

BioXSD Sequence alignment

Other types:

references to databases and ontologies, accessions,

restricted subset of numeric datatypes, global URI, names, text,

enumerations

would be great to update to Clustal Omega and FreeContact / PP Unable to sendViaPost to url http://gbio-pbil.ibcp.fr:8099/Clustalw/bx/ SocketTimeoutException: Read timed out

multi-sequence/chain features(associations)|structure|contacts|interactions|basepairing|…

Everything from here down is a bit too crowded! Other possible negatives: - no main part to focus on - no linear story line - crowded with boxes of disputable sizes and shapes

ADD IN PRINTED VERSION: - Remove year, CC, DOI - This poster can be downloaded

from bioxsd.org and QR CODE - QR codes for all social links,

also second for follow/subscribe to twitter/googlegroups

- Underline presenter - Add presenter picture

(IDEALLY PICTURES OF ALL AUTHORS, THEN ALSO IN WEB VERSION)

ONLINE & F1000 VERSION: - Remove “download from”,

presenter underline & picture (UNLESS, IDEALLY, ALL AUTHORS PICTURES), all QR codes

- Add year and CC BY-SA 4.0 (click)

- (Online & GH only, after F1000) Add DOI (click), after year and CC BY-SA 4.0, REMOVE INFO ABOUT VENUE (IF APPLICABLE)

- Align authors and affiliation to the whole width

- Update metadata!!

Update this with proper pptx shapes saying “Tool” not “WS”

references to data, identifiers/accessions, …

sequence alignments

analysis conclusions

provenance metadata

evolutionary relations

gene regulation

relations between features

variants

protein 3D structure

ncRNA structure

mRNA

genomic features

reliability metadata semantic

annotation with

ontologies

alignments

ChIP-seq values

transcriptomic reads

inferred features

experimental measurements

sequence

BioXSD data model

RDF XML

JSON

Traditional formats

Binary XML

“Efficient XML Interchange (EXI)”

Objects for programming

File “mash-ups” for annotation of any data in any format

“Interoperability” with other

schema-based formats

This figure: colour shade of XSD (green 87/157/28, blue 0/51/153), good shadow, lighter colours for rectangles, !Schemata!

references to data, identifiers/accessions, …

feature records (annotation)

sequence alignments

feature records (annotation): Example


provenance metadata


gene regulation


variants


ncRNA structure

mRNA

genomic features


annotation with

ontologies

alignments

ChIP-seq values


inferred features


sequence


provenance metadata


gene regulation


variants


ncRNA structure

mRNA

genomic features


annotation with

ontologies

alignments

ChIP-seq values


inferred features


sequence

biomolecular sequences: Example

<mySequence xsi:type=“bx:GeneralAminoacidSequenceRecord"> <bx:sequence>MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLA ELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLG

GKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL</bx:sequence> <bx:species

dbName=“NCBI Taxonomy“

accession=“9606” entryUri=“http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606”

speciesName=“Human” /> <bx:reference

dbName=“UniProt”

accession=“P43353”

entryUri=“http://www.uniprot.org/uniprot/P43353”

sequenceVersion=“1“

variantAccession=“P43353-1“ />

<bx:name>Aldehyde dehydrogenase family 3 member B1 (ALDH3B1)</bx:name> </mySequence>

1Computational Biology Unit, Department of Informatics, University of Bergen, Norway; 2Department of Informatics, University of Oslo, Norway; [email protected].

BioXSD can represent diverse interconnected features of a sequence,

together with related data and metadata, in an integrated data record.

Matúš Kalaš1, Sveinung Gundersen2, László Kaján3, Jon Ison4, Steve Pettifer5, Christophe Blanchet6, Rodrigo Lopez4, Kristoffer Rapacki7 and Inge Jonassen1

1Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway;

2Institute for Cancer Research, Oslo University Hospital and Department of Informatics, University of Oslo, Oslo, Norway;

3unaffiliated, previously Bioinformatics and

Computational Biology Department, Technische Universität München, Garching, Germany; 4European Bioinformatics Institute, EMBL, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK;

5School of Computer Science, University of Manchester, Manchester, UK;

6L'Institut Français de Bioinformatique, Gif-sur-Yvette, and Institut de Biologie et Chimie des Protéines, CNRS and Université Claude Bernard Lyon 1, Lyon, France;

7Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark.

Logos a bit of corporate shit, but actually very practical for quick visual recognition

THIS BOX SHOULD TURN INTO SUMMARY, WITH ADDITIONAL DIAGRAM OF SIZE COMPARISONS AND WITH A LIST OF SUPPORTING RESOURCES

or Bio*?

bx

Leave UniComp in just like Lyon & Rost & others left there???

Somehow also that simpleTypes for RDF!?!

BOSC 2015: BioXSD — a data model for sequences, alignments, features, measured and inferred values ISMB/ECCB2015: BioXSD — A universal data model for sequences, alignments, features, measured and infererred values ECCB2016: BioXSD | BioJSON | BioYAML — Integrated formats for sequence data / Towards unified formats for sequences, alignments, features, and annotations

CHANGE THESE FINALLY TO “TOOL”!!!

<mySequenceRecord xsi:type=“bx:GeneralAminoacidSequenceRecord“ xmlns:bx="http://bioxsd.org/BioXSD-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bioxsd.org/BioXSD-1.1 http://bioxsd.org/BioXSD-1.1.xsd"

> <bx:sequence>MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTL AELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLG

GKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL</bx:sequence> <bx:species

dbName="NCBI Taxonomy"

accession="9606” entryUri="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606"

speciesName="Human" /> <bx:reference

dbName="UniProt"

accession="P43353"

entryUri="http://www.uniprot.org/uniprot/P43353"

sequenceVersion="1"

variantAccession="P43353-1" />

<bx:name>Aldehyde dehydrogenase family 3 member B1 (ALDH3B1)</bx:name> </mySequenceRecord> { "sequence": "MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLAELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL", "species": { "dbName": "NCBI Taxonomy", "accession": "9606", "entryUri": "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606", "speciesName":"Human" }, "reference": { "dbName": "UniProt", "accession": "P43353", "entryUri": "http://www.uniprot.org/uniprot/P43353", "sequenceVersion": "1", "variantAccession": "P43353-1" }, "name": "Aldehyde dehydrogenase family 3 member B1 (ALDH3B1)" } --- sequence: MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLAELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL species: dbName: NCBI Taxonomy accession: 9606 entryUri: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606 speciesName: Human reference: dbName: UniProt accession: P43353 entryUri: http://www.uniprot.org/uniprot/P43353 sequenceVersion: 1 variantAccession: P43353-1 name: Aldehyde dehydrogenase family 3 member B1 (ALDH3B1)

+RDF???

SIMPLE EXAMPLE: BioXSD sequence record

{ "sequence": "MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDAL", "species": { "dbName": "NCBI Taxonomy", "accession": "9606", "entryUri": "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606", "speciesName": "Human" }, "reference": { "dbName": "UniProt", "accession": "P43353", "entryUri": "http://www.uniprot.org/uniprot/P43353", "sequenceVersion": "1", "variantAccession": "P43353-1", "subsequencePosition": { "segment": { "min": 1, "max": 48 } } }, "name": "Aldehyde dehydrogenase family 3 member B1 (ALDH3B1), N-terminus" }

ONGOING DEVELOPMENT: single data model with multiple choices of exchange formats and conversions

… growing noodly appendages …

BioXSD.org

<mySequenceRecord xsi:type="bx:GeneralAminoacidSequenceRecord" xmlns:bx="http://bioxsd.org/BioXSD-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bioxsd.org/BioXSD-1.1 http://bioxsd.org/BioXSD-1.1.xsd"

> <bx:sequence>MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDAL</bx:sequence> <bx:species dbName="NCBI Taxonomy" accession="9606" entryUri="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606" speciesName="Human" /> <bx:reference dbName="UniProt" accession="P43353" entryUri="http://www.uniprot.org/uniprot/P43353" sequenceVersion="1" variantAccession="P43353-1" > <bx:subsequencePosition> <bx:segment min="1" max="48"/> </bx:subsequencePosition> </bx:reference> <bx:name>Aldehyde dehydrogenase family 3 member B1 (ALDH3B1), N-terminus</bx:name> </mySequenceRecord>

--- sequence: MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDAL species: dbName: NCBI Taxonomy accession: "9606" entryUri: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606 speciesName: Human reference: dbName: UniProt accession: P43353 entryUri: http://www.uniprot.org/uniprot/P43353 sequenceVersion: "1" variantAccession: P43353-1 subsequencePosition: segment: min: 1 max: 48 name: Aldehyde dehydrogenase family 3 member B1 (ALDH3B1), N-terminus

AFTER THE TECHNOLOGY EXPLANATION, ADD THAT NO SUCH FORMAT “STANDARDISED”

UPDATE!!! +note what bx includes ?make 3 icons/colors: uses|in progress|considered +their legend? also Apache Avro? & YAML tags stuf????

BioXSD is optimised/using standards…

● loss-less compression

ADD LOGOS OF XSDS!!!

BioXSD is rich enough to enable:

● loss-less capture of diverse data that would

otherwise require use of multiple different formats

● loss-less conversions between diverse formats

REMOVE WHITE BACKGROUNDS FROM ICONS HERE AND ELSEWHERE (and update them in the figures) SEE “multiFormatDataModel_WITH SOME NONTRANSPARENT WHITE BKGRNDS.png”!!! check out and add JSON-LD under JSON as (incl. JSON-LD) – if with bx “context” or under RDF if with EDAM&co

A data model for sequences, alignments,

features, measured and inferred values

BioJSON BioYAML

Matúš Kalaš1, Sveinung Gundersen2, László Kaján3, Hervé Ménager4, Jon Ison5, Christophe Blanchet6, Steve Pettifer5, Rodrigo Lopez4, Kristoffer Rapacki7 and Inge Jonassen1 and open for contributions

/bioxsd/bioxsd @BioXSD http://groups.google.com/group/bioxsd http://bioxsd.org [email protected]

Latest stable release: http://bioxsd.org/BioXSD-1.1.xsd

BioXSD has been developed as a tree-structured data model and

exchange format for basic bioinformatics data, centered around

bio-polymer sequence [1,2]. BioXSD allows integration of diverse

features, information, measurements, and inferred values about a

biological molecule or its part or context, annotated with

provenance and reliability metadata, ontology concepts, scientific

remarks, and conclusions.

BioJSON and BioYAML are alternatives to the XML serialization of

BioXSD, following the same data model. As tree-structured data

formats, BioXSD, BioJSON, and BioYAML are particularly suitable

for programming in object-oriented languages, and for use with

web applications and web APIs (Web services), while at the same

time allowing a reasonable level of human readability.

BioXSD|BioJSON|BioYAML are now further developed together

with GTrack, GSuite, and BTrack (a family of universal tabular

formats for features and metadata [2,3]), under the umbrella of

ELIXIR Norway, with an open group of international contributors

(http://bioxsd.org/#Contact). The BioXSD|GTrack family of generic,

convenient, and interoperable data formats provides an added

value for integrative bioinformatics and omics.

ONGOING DEVELOPMENTS: single data model with multiple choices of exchange formats and conversions

BioXSD data model

RDF BioXSD in XML BioJSON

Traditional

bioinformatics formats

EXI (binary XML),

BSON (binary JSON) for omic-scale data

Objects for programming

Ad-hoc custom formats tabular and key-value

conversion

serialisation

usability

BioYAML GTrack GSuite

TSVs with column definitions

http://gtrack.no

“Interoperability” with other schema-

based and tree-based formats

File “mash-ups” with metadata for annotation of any data in any format

Key-value pairs

Tab-separated values

Binary format

Traditional formats:

RDF format

Semantic Web: Tree-structured:

XML format

JSON format

1D

2D

Tree Graph

Different paradigms of data formatting represent data differently.

YAML format

A machine-understandable definition of a specific format (a data model, a schema) is highly beneficial for validation and maintainability.

XML Schema (XSD) 1.0

XML Schema 1.1

Relax NG

JSON Schema

…

OWL

…

GTrack format TSV with column

definitions

http://gtrack.no

ADD LINK TO RG PROJECT?? (when updated)

2016 DOI: 10./

2016 DOI: 10./

BioXSD Towards unified formats for sequences,

alignments, features, and annotations BioJSON BioYAML

2018

[1] Kalaš, M., Puntervoll, P., Joseph, A., Bartaševičiūtė (now Karosiene), E., Töpfer, A., Venkataraman, P., Pettifer, S., Bryne, J.C., Ison, J., Blanchet, C.,

Rapacki, K. and Jonassen, I. (2010). BioXSD: the common data-exchange format for everyday bioinformatics web services. Bioinformatics, 26, i540-i546.

DOI: 10.1093/bioinformatics/btq391 PMID: 20823319 Open Access

[2] Gundersen, S., Kalaš, M., Abul, O., Frigessi, A., Hovig, E. and Sandve, G.K. (2011). Identifying elemental genomic track types and representing them

uniformly. BMC Bioinformatics, 212, 494. DOI: 10.1186/1471-2105-12-494 PMID: 22208806 Open Access

[3] Gundersen S., Kalas M., Simovski B. et al. (2018). The GTrack ecosystem - expressive file formats for analysis of genomic track data [version 1;

not peer reviewed]. F1000Research, 7(ELIXIR):270. Poster. DOI: 10.7490/f1000research.1115292.1 Open Access

http://gbio-pbil.ibcp.fr:8099/Clustalw/bx/





mailto:[email protected]



https://github.com/bioxsd/bioxsd

https://twitter.com/BioXSD

http://groups.google.com/group/bioxsd

http://bioxsd.org/


http://bioxsd.org/BioXSD-1.1.xsd





http://gtrack.no/

http://bioxsd.org/#Contact





http://gtrack.no/

http://gtrack.no/

https://creativecommons.org/licenses/by-sa/4.0/

http://creativecommons.org/licenses/by/4.0/

http://creativecommons.org/licenses/by/4.0/

http://doi.org/10.1093/bioinformatics/btq391



http://www.ncbi.nlm.nih.gov/pubmed/20823319

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://doi.org/10.1186/1471-2105-12-494

http://www.ncbi.nlm.nih.gov/pubmed/22208806

https://doi.org/10.7490/f1000research.1115292.1










Matúš Kalaš AUTHORS, THEN ALSO IN 2 A data model for ...bioxsd.org/bxPoster.pdf · update them...

Documents

Transcript of Matúš Kalaš AUTHORS, THEN ALSO IN 2 A data model for ...bioxsd.org/bxPoster.pdf · update them...