ChemSpider hosting linking and curating chemistry data for the community

37
ChemSpider – Hosting, Linking and Curating Chemistry Data for the Community Valery Tkachenko SLA Meeting, June 2011

description

The Internet is the world’s publicly accessible container for a myriad of resources containing chemistry related data. Whether it be collections of millions of chemical compounds with their associated properties, interactive displays for analytical data, access to publications and patents or tapping into the increasing availability of online computational engines, the web has became the primary enabling technology to source information and data. Scientists collectively applaud and utilize the availability of such resources and an increasing proportion of the community are willing to support these resources by contributing both their data and skills to help curate and validate information on the web. This “crowdsourcing” has started to contribute large amounts of data to the commons and serves has a valuable platform for reference and, potentially, discovery. ChemSpider is one of the chemistry community’s primary online resources and allows scientists to search across 25 million unique chemical compounds linked out to over 400 original data sources and has become a central hub for searching for chemistry-related data. The platform however offers much more to the community and has become a central repository for analytical data, specifically spectra, is a host for community-authored chemical syntheses and facilitates data curation and annotation by any of its users. This presentation will provide an overview of the ChemSpider platform in terms of available data and its efforts to act as a public repository and clearing ground for data curation. We will discuss how such a platform, when coupled with game-based approaches, facilitates both teaching and data validation and will discuss whether public domain resources such as ChemSpider will ultimately become authorities for chemistry.

Transcript of ChemSpider hosting linking and curating chemistry data for the community

Page 1: ChemSpider  hosting linking and curating chemistry data for the community

ChemSpider – Hosting, Linking and Curating Chemistry Data for the

Community Valery Tkachenko

SLA Meeting, June 2011

Page 2: ChemSpider  hosting linking and curating chemistry data for the community

Chemistry on the Internet

100s of websites hosting chemistry-related data Chemistry information is generally “compound-based”

Chemical “structures” Identifiers, names and synonyms Properties Analytical data How to synthesize Articles, patents, safety information

Chemistry “language and dialects”

Page 3: ChemSpider  hosting linking and curating chemistry data for the community

Dialects describing chemicals

Page 4: ChemSpider  hosting linking and curating chemistry data for the community

A Pragmatic Vision

“Build a Structure Centric Community”

Integrate chemistry across the internet based on “chemical structure”

A “structure-based hub” to information and data Let chemists contribute their own data Allow the community to curate & annotate data

Page 5: ChemSpider  hosting linking and curating chemistry data for the community

www.chemspider.com

Page 6: ChemSpider  hosting linking and curating chemistry data for the community

Answering Questions for Chemists Questions a chemist might ask…

What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Aspirin? What is the NMR spectrum of Benzoic Acid? What are the safety handling issues for toluene?

Page 7: ChemSpider  hosting linking and curating chemistry data for the community

Search for a Chemical…by name

Page 8: ChemSpider  hosting linking and curating chemistry data for the community

Available Information… Linked to chemical vendors, safety data, toxicity,

metabolism…

Page 9: ChemSpider  hosting linking and curating chemistry data for the community

Available Information….

Page 10: ChemSpider  hosting linking and curating chemistry data for the community

ChemSpider Today

Over 26 million unique chemicals Over 420 data sources Grows daily – community and RSC depositions Community annotation and curation

We curate, edit, change, enhance data daily

Page 11: ChemSpider  hosting linking and curating chemistry data for the community

Three Years of Experience Internet-based chemistry is a mess!

Public compound databases are contaminated

The annotation/curation of data online is difficult

Most database hosts are non-responsive to feedback – “We are a host/repository of data”

Who cares? We all should!!!

Page 12: ChemSpider  hosting linking and curating chemistry data for the community

Linked Data on the Web

Page 13: ChemSpider  hosting linking and curating chemistry data for the community

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 14: ChemSpider  hosting linking and curating chemistry data for the community

What is the Structure of Vitamin K1?

Page 15: ChemSpider  hosting linking and curating chemistry data for the community

What is the Structure of Vitamin K1?

Page 16: ChemSpider  hosting linking and curating chemistry data for the community

Chemical Abstracts“Common Chemistry” Database

Page 17: ChemSpider  hosting linking and curating chemistry data for the community

Wikipedia

Page 18: ChemSpider  hosting linking and curating chemistry data for the community
Page 19: ChemSpider  hosting linking and curating chemistry data for the community
Page 20: ChemSpider  hosting linking and curating chemistry data for the community

Internet-Based Chemistry is a Mess

Algorithms can get you so far

Human curation is necessary

Only the crowds can help with big data… ChemSpider is over 26 million compounds

Imagine if we worked together to create a centralized validated structure-name dictionary! Enhances text-mining, searching, linking…

Page 21: ChemSpider  hosting linking and curating chemistry data for the community

Search “Vitamin H”

Page 22: ChemSpider  hosting linking and curating chemistry data for the community

Search “Vitamin H”

Page 23: ChemSpider  hosting linking and curating chemistry data for the community

“Curate” Identifiers

Page 24: ChemSpider  hosting linking and curating chemistry data for the community

“Curate” Identifiers

Page 25: ChemSpider  hosting linking and curating chemistry data for the community

“Curate” Identifiers

Page 26: ChemSpider  hosting linking and curating chemistry data for the community

Crowd-sourcing Chemistry Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 27: ChemSpider  hosting linking and curating chemistry data for the community

“Curate” Identifiers

General curation activities Remove incorrect names Correct spellings Add multilingual names Add alternative names

In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually

130 people have participated in validation or annotation. “Crowds” can be quite small!

Page 28: ChemSpider  hosting linking and curating chemistry data for the community

Vancomycin – Curate This!!!

Page 29: ChemSpider  hosting linking and curating chemistry data for the community

Vancomycin on ChemSpider 1 compound – 3 days

Page 30: ChemSpider  hosting linking and curating chemistry data for the community

Crowdsourced “Annotations”

Users can add Descriptions/Syntheses/Commentaries Links to articles Spectral data Photos MP3 files Videos

Page 31: ChemSpider  hosting linking and curating chemistry data for the community

Multimedia Content Holder

Page 32: ChemSpider  hosting linking and curating chemistry data for the community

Gaming for Validation of Spectra

Page 33: ChemSpider  hosting linking and curating chemistry data for the community

Crowdsourced Validation of Spectra

Page 34: ChemSpider  hosting linking and curating chemistry data for the community

“Game-based” Validation of Data

Page 35: ChemSpider  hosting linking and curating chemistry data for the community

ChemSpider SyntheticPages

Page 36: ChemSpider  hosting linking and curating chemistry data for the community

Sharing Our Activities

Presently defining approaches with other public compound databases to share results of curation activities

Member of large European project to link data from the Life Sciences. Sharing results of curation is essential

Making curation and contribution interfaces Mobile.

Page 37: ChemSpider  hosting linking and curating chemistry data for the community

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams