ChemSpider – Is This The Future of Linked Chemistry on the Internet?
Antony WilliamsBAGIM, Boston, August 2010
Our dog has fleas
It’s not an Advantage…
What is the structure of “Advantage”?
Audience Participation Time….
Where would you look? What would you trust? Where would you look ONLINE?
What is the Structure of Vitamin K?
MeSH
A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
What is the Structure of Vitamin K1?
Wikipedia
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”
Variants of systematic names on PubChem
2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Bioassay Data are Associated…
Structures on DailyMed
Lack of Stereochemistry
Does Stereochemistry Matter?
Does one stereocenter matter?
Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
Incorrect Structures
Wow!
ChEBI – Manual Curation
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
PubChem InChIKeys
MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N
PubChem InChIKeys
MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N
InChIs
InChIs are proliferating across databases InChIs are increasingly used by publishers Single code base – no multi-flavored SMILES
InChIs are “incomplete” but very useful…
Vancomycin – Search the Internet
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Is this the structure of Vitamin K1?
Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
Where Would You look? What Do You Trust?
Question Everything online: www.dhmo.org
It’s all on Wikipedia…
What’s Methane?
What’s Methane?
What ELSE is Methane???
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Feedback from Steve Ritter
“As for where we source our structures, our primary source is the researcher and peer-reviewed papers, because many compounds are novel.
..we always double check them against one or more primary sources, typically Merck Index and SciFinder.
Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”
Feedback from Steve Ritter
“As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone.”
“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”
A vision…
Authoritative web-based source of standard, well-drawn structures With associated data – spectra, property data,
ADME/Tox data, Bioassay data Linked to encyclopedic articles, publications,
patents, MSDS/safety sheets Links to chemical vendors Links to property predictions
A Pragmatic Vision
“Build a Structure Centric Community”
December 2006 – A hobby project initiated to connect chemistry on the web
Integrate chemical structure data on the web Create a “structure-based hub” to information and
data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data
media.obsessable.com
As few interfaces as possible
What do humans want?
www.chemspider.com
We’re Out to Answer Questions
Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
Search for a Chemical…by name
Available Information…
Linked to vendors, safety data, toxicity, metabolism
Available Information….
Search for a chemical…by structureSubstructure search coming…
Annotating, Cleaning and Growing...
Almost 25 million chemicals from 400 diverse data sources
“Diverse” data sources… High Quality through questionable to wrong Rich content of Wikipedia links, YouTube videos
and photographs to “Stub Records” containing “just a structure”
All records can be further enhanced…25 million compounds need annotation by the masses
Search “Vitamin H”
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
General curation activities Remove incorrect names Correct spellings Remove names with/without stereo compared
to the structure Correct registry numbers and other numeric
identifiers (Beilstein, EINECS etc) Add multilingual names Add alternative names
Crowdsourced “Annotations”
Registered Users can add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
Spectra Linked
Spectra Linked
Link off a structure in ChemSpider
Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
Semantic Markup: Project Prospect
Success Depends on Dictionaries
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
“Chemicalizing” Pages
“Chemicalizing” Pages
ChemSpider SyntheticPages
ChemSpider SyntheticPages
ChemSpider Everywhere:What do computers want?
Web services
Web Services
ChemSpider Everywhere
Linked from Wikipedia and many Public Databases
Linked from Open Notebook Science sites
Linked from Blogs using Structure/Spectra EMBED
Integrated into structure drawing packages
Integrated to software offerings from Thermo, Waters, Agilent, Bruker
Structure Database Lookup
Structure Database Lookup
Reaction Database Look-up
Reaction Database Look-up
There will always be gaps...
What ChemSpider does not deal with, yet...
Materials Minerals Polymers Biological macromolecules
ChemSpider Tomorrow
6 months: >1.2M compounds/month 6 months: >800,000 new uniques 6 months: >60 new data sources added
Continue the curation effort and keep cleaning Finish depositions – millions left to deposit Integrate RSC content – a massive archive! Integrate RSC publishing workflows and databases Enable the semantic web for chemistry – RDF was
layered on last week
The Future of Linked Chemistry on the Internet? I can buy my wife a “methane ring” for Xmas There are more than 10 compounds called
Vitamin K1 on PubChem… Most databases online cannot be annotated The public funds the generation of data that is
then mis-associated, cannot be used for modeling, for reference, for…
Low quality databases become authorities The community accepts the status quo
The PREFERABLE Future of Linked Chemistry on the Internet? Public compound databases federate to build a
truly linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make
publications discoverable Public-Private databases can be linked Open Data proliferate RDF is everywhere
Business models WILL change
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams
Top Related