RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Post on 11-May-2015

2.818 views 2 download

Tags:

description

The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. The Royal Society of Chemistry hosts ChemSpider, a free access website for chemists built with the intention of building community for chemists (http://www.chemspider.com/). ChemSpider is an aggregator of chemistry related information, at present over 20 million unique chemical entities linked out to over 300 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. It is also a public deposition platform where chemists can deposit their own data including novel structures, analytical data, synthesis procedures and host data associated with the growing activities associated with Open Notebook Science. This presentation will examine chemistry on the internet, the dubious quality of what is available and how the ChemSpider crowdsourced curation platform is fast becoming one of the centralized hubs for resourcing information about chemical entities. We will also review our efforts to provide free resources for synthesis procedures, spectral data and structure-based searching of the chemistry literature and how chemists can contribute directly to each of these projects.

Transcript of RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Lawrence Berkeley National Laboratory, March 2010,

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

The Final Search Strategy

All Those Names, One StructureA problem to solve…

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Trustworthy Chemistry? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Where Would You look? What Do You Trust?

Question Everything online: www.dhmo.org

Di-Hydrogen Monoxide

2H

Di-Hydrogen Monoxide

2H + 1O

Di-Hydrogen Monoxide

H2O

Di-Hydrogen Monoxide

H2OWater

It’s all on Wikipedia…

Chemistry on The Internet Is Messy

It’s Methane…

What’s Methane?

What’s Methane?

What ELSE is Methane???

Drugs are REALLY Messy

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem C&E News (from ACS)

Feedback from C&E Senior Editor

“Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”

“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”

Structural Data for LifeSciencesDailyMed

Lack of Stereochemisty

Incorrect Structures

Ugh…

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Just “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider

media.obsessable.com

As few interfaces as possible

What do humans want?

A Pragmatic Vision“Build a Structure Centric Community to

Serve Chemists”

December 2006 – A hobby project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Answer Questions

Questions a student might ask… What is the structure of levulinic acid? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? How can I synthesize 2,4-dichlorophenol? What are the safety handling issues for Thymol Blue?

What is Levulinic Acid?

What is Levulinic Acid?

Basic Info

Wikipedia and External Links

External Links to Data

Linked across the internet

Kyoto Encyclopedia of Genes and Genomes

Google Patent Integration

Access to Articles

RSC Journals RSC Books PubMed Google Scholar Google Books Microsoft Academic Search

Access to Articles

Google Scholar

Experimental and Predicted Properties

ChemSpider : Spectra Linked

Search “OEA”

Search OEA

Search OEA

Search OEA

Linked Patents for OEA

Statistics for Today

>25 million compounds from >300 data sources

About 7000 unique users per day and up to ½ million transactions per day

A crowdsourced deposition and curation platform

Grows daily – more depositions, more links, more data

Searching Chemistry on the Internet

How complete a result set will we get if we search for “chemicals” by name?

Is there a better way to link chemistry databases? Linking by “names” is dangerous

Chemists want structure and SUBstructure searching

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

Link the Internet with InChIKeys!

Taken from: Rafael Sidis’ Blog

Vancomycin – Search the Internet

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Full Molecule Search: 4 Hits

Full Skeleton Search: 104 Hits

Vancomycin on ChemSpider 1 compound – 3 days

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

is what???

The InChI “Resolver”

InChI Resolver to DOIsStructure Search the Web

Most Chemistry is NOT Published

Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found

The CAS Registry

CAS Registry

Crowd-sourcing Curation and Deposition

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Building a Structure Centric Community for Chemists

Multi-level Curation and Approval

Entity-Extraction, Mark-up, Annotate

Semantic Markup: Project Prospect

Success Depends on Dictionaries

Link to a Structure or the Right Structure?

Name-Structure Pairs

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Org Prep Daily (Blog)

Micro- and Nano-publications

Blogs, wiki entries and even Amazon book reviews are micro/nano-publications

ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume

Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data.

ChemSpider SyntheticPages

Submission process Register as a user Use the Submit button and fill in the fields…

Submission Process

Submissions reviewed by editorial board

Published as is or comments sent to author

Online Peer Review process

Data supported include web movies, images, live spectra etc.

ChemSpider : Spectra Linked

Spectra Linked

Spectra Linked

ChemSpider ID 24528095 H1 NMR

ChemSpider ID 24528095 C13 NMR

ChemSpider ID 24528095 HHCOSY

ChemSpider ID 24528095 HSQC

ChemSpider ID 24528095 HMBC

Full C13 assignment uploaded

Not Just NMR Data

Spectra on ChemSpider

Available Spectra http://www.chemspider.com/spectra.aspx

Sources of Spectra

Sourced from online sources with permission

Private collections

The MAJORITY deposited by ChemSpider users

How Could Students Help? Part 1

Students can help “curate” the data – check whether the spectra are consistent with the compound

If not then flag them, annotate them and provide feedback

OR…play the game

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

Spectral Game

Increasing Complexity

Spectral Game

True Curation of Data

How Could Students Help? Part 2

Add their own data to the database!

Spectra from: research projects lab sessions supplementary data sections in publications

Spectral Uploading

Locate the structure of interest and deposit spectrum

Spectral Uploading Various types of NMR spectra supported

Deposit spectra against new structure

If a NEW compound has spectral data then deposit the structure onto ChemSpider first

How Else Can Students Help?

Students can deposit single structures or thousands of structures – UNIQUE chemistry can be added and “claimed”

Data can be curated/edited and annotated – simply register and request the rights

25 million structures, >300 data sources…there are errors of course!

NMRShiftDB

NMRShiftDB: http://www.ebi.ac.uk/nmrshiftdb/

NMR Prediction

Multinuclear NMR Prediction

NMRShiftDB Data Review

• High quality NMR shift set of ca. 100,000 shifts• Multiple outliers identified • Removed followed publication

ChemSpider Integrated NMR Prediction

Initial integration in place

A Game Through Embedding Data

Embedding Structures

Do you write Wikipedia Articles?

Do you write Wikipedia Articles?

ChemSpider Web Services

How Can You Help ChemSpider?

Deposit your data and share with the community Structures – one or many Spectra Links Syntheses into SyntheticPages

Curate data – most basic level…just add comments Spread the word – ChemSpider is an untapped

resource

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Classical business models will have to morph

Thank you

antony.williams@chemspider.comTwitter: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams