The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg,...

6
The CLARION Project for the “Infrastructure for Integration in Structural Sciences” (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION Chemical Laboratory Repository In/Organic Notebooks Principal Investigator: Peter Murray-Rust Co-Investigator: Jim Downing Project Team: Nick Day, Sam Adams, Brian Brooks Unilever Centre, Department of Chemistry, University of Cambridge

Transcript of The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg,...

Page 1: The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory.

The CLARION Project for the

“Infrastructure for Integration in Structural Sciences”(I2S2) mtg, Rutherford Labs, 11th February 2010

CLARION – Chemical Laboratory Repository In/Organic NotebooksPrincipal Investigator: Peter Murray-Rust

Co-Investigator: Jim DowningProject Team: Nick Day, Sam Adams, Brian Brooks

Unilever Centre, Department of Chemistry, University of Cambridge

Page 2: The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory.

CHEM-0repository

EmMaEmbargo Mgr

ELN(IDBS)

Crystall-ography

Files (CIF)

NMRfiles

CML, RDF

RDF triplestores

SPARQL interface

CLARION query app

CLARION overview

CHEM-1repository

Data Releaser

Publicationsdatabase

JUMBO converters

EmMa user

interface

ExternalScientist

InternalScientist

1. Scientist collects data & stores it in variety of locations2. EmMa is notified about the new content3. Scientist specifies the release conditions for the data4. Timer waits until release conditions are met5. Data is moved into CHEM-1 repository...6. ... and (at some time) into CHEM-0 repository7. Repository queried by scientists

1

2

3

4 6Data

Loader

5 7

Page 3: The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory.

ELN server

File Feed

ELN Feed

Lensfield Loader

ELN

DataFiles

CHEM-0/1repository

• Jetty webserver• cron jobs• Java

• Jetty webserver• cron jobs• Java

GUI client

Design principles used:•Decoupling through standard web interfaces (http, Atom)•Avoid data duplication (by using http references unless a copy is required)•Don’t do manually that which can be done automatically•Manual semantification as early as possible•Automatic semantification as late as possible•Give ability to undo an action during a grace period rather than getting confirmation

• Jetty webserver• Java• H2db for metadata

• JUMBO converters• Ontologies:

• ChemAxiom• ORE• ORE Chem Expt

• Jetty webserver• Java & Clojure

CML

RDF

RDFTriplestore

ChemicalStructureindex

• Jetty webserver• Java• SPARQL

Blue boxes indicate logical machine environments

CLARION architecture

• SOAP

CLARION repository

• Sesame• Chemicx

EmMa’s role:•Adds metadata•Defines embargo release conditions•Is the gatekeeper for metadata quality•Is the gatekeeper for security (trust, authentication, authorisation)

EmbargoManager(EmMa)

QuerySystem

Page 4: The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory.

Scientists presented with data records to which they add metadata and then set embargo release conditions

EmMaSources RepositoryData

Loader

Stage 1

Stage 2

Stage 3

1

2

3

CLARION development stages & timings

Stage 1: First data-feed into EmMa•Atom-feeds from file stores•EmMa feed-readers•EmMa user review tool•EmMa output atom-feeds

Stage 2: Basic functionality to store first data-type into repository•Lensfield reads EmMa feeds•Process data to CML•Process CML to RDF•Store triples into triple-store•Indexing of chemical structures

Stage 3: Basic querying functionality•Authentication & authorisation•Pilot users loading data•V1 query tool

Data stored in RDF and chemical structures indexed System in use by pilot

users & simple query interface for SSS & RDF queries. Querying by outside users.

Page 5: The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory.

EmMa

EmMa: A general tool for controlling data release between systems ?

ISIS

ELN

XRay

NMR

Etc

PubChem

PDB

Chem-1

Chem-0

NCS eCrystals

Atom feed

Atom feed

Atom feed

Atom feed

Atom feed

PublicAtom feed

Fully semantified data (RDF)

Original data plus basic metadata

PrivateAtom feed

Pump

Pump

Page 6: The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory.

Institution A

EmMa

Rutherford

neutron

Institution B

EmMa

Events:1.Scientist sends sample to Rutherford2.Rutherford stores data locally and sends copy back to scientist3.Institution’s EmMa is informed about new data4.Scientist specifies data release conditions5.Release conditions reached, data released to public repository6.Rutherford monitors institution’s atom feed, detects data is released7.Rutherford makes data visible in their own public-access repository

Private repository

Public repository1

234

65

7

How EmMa could facilitate data release in collaborating institutions