Royal Society of Chemistry activities to develop a data repository for chemistry-specific data
-
Upload
haley-oliver -
Category
Documents
-
view
20 -
download
0
description
Transcript of Royal Society of Chemistry activities to develop a data repository for chemistry-specific data
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data
Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor, Peter Corbett, Jon Steele, Valery Tkachenko and Antony Williams,
ACS Dallas
March 2014
Data in a Scientific Publication
• This is not new, you known the story…• So much data of value contained within a
publication and delivered in a PDF form• PDF files, and especially unclear licensing,
don’t allow me at the data so I can rework, reuse, repurpose, text mine etc.
• I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with capabilities I need, and the publishers should just do it
And over the years, progress…
• There is much progress with open access, data access, licensing, enhanced articles, open data, free online tools, open source codes, publishers waking up, scientists contributing
• We should be excited at what is available now, what the future holds, what opportunities exist in front of us
But it’s not easy…US
• Not everything we would like around data handling is there for sure
• Many systems, tools, platforms are already available but we don’t know about them or even if we did contributing us “more work”
• “What’s in it for me?”, “It’s my data”, “It’s too much work”, “What credit do I get?”
An Initial “Vague” Vision Set
• Manage “all” of the chemistry data associated with chemical substances
• Data to be downloadable, reusable, interactive• Build a platform that enables the scientist
• Data storage, validation, standardization and curation
• Collaborative data sharing• Provide data platform that can enable and
enhance publishing of scientific papers
Data Repository
• Registration of chemical compounds
• Deposition of chemical syntheses
• Addition of analytical data
• Integration to electronic notebooks
• Rewards and recognition for data sharing
• Document processing
• Hosting of data as private, embargoed or public
I hate text mining data
• DERA: Developing pipelining tools for text-mining so we will be able to process documents for mark-up• Compound extraction/markup• Reaction extraction/conversion• Convert “text spectra” to generate spectral
libraries… AGGHHHHH!
Data Preferences - total bias
• Views of a spectroscopist• Give me the data – interactive, downloadable
spectrum is way more valuable to me (processed spectrum and FID available)
• Spectral header in JCAMP standard is very incomplete (and most spectral standards)
• I want ASSIGNED/ANNOTATED spectra if possible – don’t “textify” a spectrum!
Solving the problem here..
• Binary file formats are problematic – think of the variations in instrumentation and software
• Standards can be defined – are they correctly implemented? CIF and its Checking, Spectral standards - JCAMP versions, Structure formats, etc…
• Metadata is crucial
…and what does it solve?
• “Fixing the data” – data can’t be faked as easily
• Reprocessing of analytical data can be done…weighting functions, baseline correction, deconvolution etc.
• I can convert and store it locally
But solve it for many things
• I want molecules as structure formats not images
• Please don’t make us hack tables of data
• Tell us how you generated your files – software version, software libraries, etc.
Input data pipeline
Deposition Gateway
Staging databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds Module
Spectra Module
Reactions Module
Materials Module
TextminingModule
!Module
Web UI for unified depositions
DropBox, Google Drive, SkyDrive, etc
LabTrove and other templated data
Documents
API, FTP, etc
Raw data Validated dataStaging
databases
All databases are sliced by data sources/data
collections and have simple
security model where each data
slice/source is private, public or
embargoed
Input data pipeline
Deposition Gateway
Staging databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds Module
Spectra Module
Reactions Module
Materials Module
TextminingModule
!Module
Web UI for unified depositions
DropBox, Google Drive, SkyDrive, etc
LabTrove and other templated data
Documents
API, FTP, etc
Raw data Validated dataStaging
databases
All databases are sliced by data sources/data
collections and have simple
security model where each data
slice/source is private, public or
embargoed
User Interface Approach
Compounds Reactions Spectra Materials Documents
CompoundsAPI
ReactionsAPI
SpectraAPI
MaterialsAPI
DocumentsAPI
CompoundsWidgets
ReactionsWidgets
SpectraWidgets
MaterialsWidgets
DocumentsWidgets
Data tier
Data access tier
User interface
components tier
Analytical Laboratory application
User interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
User Interface Approach
Compounds Reactions Spectra Materials Documents
CompoundsAPI
ReactionsAPI
SpectraAPI
MaterialsAPI
DocumentsAPI
CompoundsWidgets
ReactionsWidgets
SpectraWidgets
MaterialsWidgets
DocumentsWidgets
Data tier
Data access tier
User interface
components tier
Analytical Laboratory application
User interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
User Interface Approach
Compounds Reactions Spectra Materials Documents
CompoundsAPI
ReactionsAPI
SpectraAPI
MaterialsAPI
DocumentsAPI
CompoundsWidgets
ReactionsWidgets
SpectraWidgets
MaterialsWidgets
DocumentsWidgets
Data tier
Data access tier
User interface
components tier
Analytical Laboratory application
User interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
User Interface Approach
Compounds Reactions Spectra Materials Documents
CompoundsAPI
ReactionsAPI
SpectraAPI
MaterialsAPI
DocumentsAPI
CompoundsWidgets
ReactionsWidgets
SpectraWidgets
MaterialsWidgets
DocumentsWidgets
Data tier
Data access tier
User interface
components tier
Analytical Laboratory application
User interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
Analytical Chemist
Characterize
Measure
Search
Store
<<include>>
<<include>>
<<include>>
Synthetic Chemist
Search(synthetic procedure)
Document(publish synthetic procedure)
Retrosynthetic analysis
Medicinal Chemist
Search(against database of properties)
Source(find vendor)
Analyse(cluster, dock, screen)
Computational Chemist
Search or Develop algorithm
Store results
Run calculations
Synthesize
Measure activity
Addition of Analytical Data
• Spectral Container is in development using componentized widgets for display
• NIST spectra converted into standardized JCAMP format for deposition - 296,103 spectra deposited
• 10% of remaining NIST spectra need to be curated as there are obvious structure issues
Electronic Notebook Data
• Development work integrating chemistry into the Southampton Labtrove notebook• Stoichiometry table development• Analytical data integration
• “ChemTrove” rolled out to a small test group in January
Present activities – ACS Fall
• Deposition process development of compounds, reactions and spectral data by Spring• FTP, DropBox, Web-upload, ELN integration
• Compounds, Reactions, Spectral data search, display, download
• Data sharing – private, public, collaborative
• Metadata, metadata, metadata standards!
• Open Sourcing CRD and CVSP
Acknowledgments
• Jeremy Frey and Simon Coles, University of Southampton
• Will Dichtel and Leah McEwan, Cornell University
• Stuart Chalk, University of North Florida
• Bob Hanson and Bob Lancashire, Jmol and JSpecView
Thank you
Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams