Checking, Curating And Qualifying Chemistry

53
Checking, Curating and Checking, Curating and Qualifying Chemistry to Qualifying Chemistry to Build a Structure Centric Build a Structure Centric Community for Chemists Community for Chemists Rutgers University Rutgers University 12/2/2008 12/2/2008 Antony Williams Antony Williams

description

An overview of what we do to curate and annotate small molecules and how it's the basis of Chemmantis. A presentation given to the PDB team at Rutgers University

Transcript of Checking, Curating And Qualifying Chemistry

Page 1: Checking, Curating And Qualifying Chemistry

Checking, Curating and Checking, Curating and Qualifying Chemistry to Build Qualifying Chemistry to Build

a Structure Centric a Structure Centric Community for ChemistsCommunity for Chemists

Rutgers University 12/2/2008Rutgers University 12/2/2008

Antony WilliamsAntony Williams

Page 2: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

ChemSpider - A Search Engine for ChemSpider - A Search Engine for ChemistsChemists

Questions a chemist might ask…Questions a chemist might ask… What is the melting point of n-butanol? What is the melting point of n-butanol? What is the chemical structure of Xanax?What is the chemical structure of Xanax? Chemically, what is phenolphthalein?Chemically, what is phenolphthalein? What are the stereocenters of cholesterol?What are the stereocenters of cholesterol? Where can I find publications about xylene?Where can I find publications about xylene? What are the different trade names for What are the different trade names for

Ketoconazole?Ketoconazole? What is the NMR spectrum of Aspirin?What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol What are the safety handling issues for Thymol

Blue?Blue?

ChemSpider can answer all of these questionsChemSpider can answer all of these questions

Page 3: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Tell Me About GlutathioneTell Me About Glutathione

Page 4: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Tell Me About GlutathioneTell Me About Glutathione

Page 5: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Tell Me About GlutathioneTell Me About Glutathione

Page 6: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Tell Me About GlutathioneTell Me About Glutathione

Page 7: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Tell Me About GlutathioneTell Me About Glutathione

Page 8: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Link outsLink outs

Page 9: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Links out to KEGGLinks out to KEGGKyoto Encyclopedia of Genes and Kyoto Encyclopedia of Genes and

Genomes Genomes

Page 10: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

How many names does a How many names does a compound have?compound have?

Page 11: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

ChemSpider Data ContentChemSpider Data Content

Over 21.5 million unique chemical structures from Over 21.5 million unique chemical structures from ca. 150 data sourcesca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, WikipediaOnline Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Literature – PubMed, J Het Chem, Nature, RSC, Open

AccessAccess Chemical Vendors – over 40 different vendors and Chemical Vendors – over 40 different vendors and

growinggrowing Personal Depositions – individual contributionsPersonal Depositions – individual contributions Content database vendorsContent database vendors Analytical data collectionsAnalytical data collections PatentsPatents Web scrapingWeb scraping

Content is linked back to the original data sourcesContent is linked back to the original data sources

Page 12: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Complex SearchComplex Search

Page 13: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

The Quality of Data Online…The Quality of Data Online…

Aggregating data opens up quality issuesAggregating data opens up quality issues Structure-identifier associations are “dirty”Structure-identifier associations are “dirty” Structures are COMMONLY incorrectStructures are COMMONLY incorrect Manual curation of small databases is enough Manual curation of small databases is enough

work – what about millions of structures?work – what about millions of structures? Structures are far from perfect. What is a Structures are far from perfect. What is a

“correct structure”?“correct structure”? Full stereochemistry? Full stereochemistry? Historical timeline of structure?Historical timeline of structure? Who is the authority?Who is the authority?

Page 14: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Quality is a Major Issue- Search Quality is a Major Issue- Search ButanolButanol

OLD EXAMPLE..now fixedOLD EXAMPLE..now fixed

Page 15: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Wikipedia Chemistry Curation Wikipedia Chemistry Curation projectproject

Only ca. 5000 organic structures, Only ca. 5000 organic structures, 7000 total structures7000 total structures

Almost a year of work so far for a Almost a year of work so far for a team of 6 peopleteam of 6 people

Many errors removed in the process. Many errors removed in the process. Curation process is a daily event for Curation process is a daily event for users/depositorsusers/depositors

Slow and torturous processSlow and torturous process

http://en.wikipedia.org/wiki/Talk:Tachttp://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structurrolimus#IUPAC_Name_and_structuree

Page 16: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Wikipedia CurationWikipedia Curation

Looking for self-Looking for self-consistency across a consistency across a Wikipedia PageWikipedia Page

Primary key is the article Primary key is the article TITLETITLE

The chemical shown The chemical shown needs to match the titleneeds to match the title

Cyclic self-consistency – Cyclic self-consistency – and decisions must get and decisions must get mademade

Page 17: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Other issues…Other issues…

Page 18: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

ChargesCharges

Page 19: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Sugars – Machine Readable vs Sugars – Machine Readable vs AestheticsAesthetics

Haworth Stereo Fischer

Page 20: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Wikipedia – Crowdsourcing Wikipedia – Crowdsourcing ChemistryChemistry

Page 21: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Thymol Blue on ChemSpiderThymol Blue on ChemSpider

Data online includes:Data online includes: UV-vis spectrumUV-vis spectrum Measured experimental propertiesMeasured experimental properties Link to Wikipedia articleLink to Wikipedia article Links to chromatography detailsLinks to chromatography details Multiple identifiers/trade names etc.Multiple identifiers/trade names etc. Links to vendors/suppliers/other databasesLinks to vendors/suppliers/other databases Safety informationSafety information

http://www.chemspider.com/q/thymol%20bluehttp://www.chemspider.com/q/thymol%20blue

Page 22: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Crowd-sourcing CurationCrowd-sourcing Curation

How to curate data for millions of How to curate data for millions of structures? structures?

Robot processes can clean up depositionsRobot processes can clean up depositions Search for Chloride and check molecular formula Search for Chloride and check molecular formula

for Clfor Cl Check for stereochemistry and remove names Check for stereochemistry and remove names

with stereo with stereo Provide a simple-to-use platform to curate, Provide a simple-to-use platform to curate,

annotate and tag data annotate and tag data Provide curator administration to prevent Provide curator administration to prevent

vandalism (Veropedia)vandalism (Veropedia)

Page 23: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Post CommentsPost Comments Anyone can “Post Comments” associated Anyone can “Post Comments” associated

with a structure. To curate data we with a structure. To curate data we require login to trackrequire login to track

Page 24: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Multi-level Curation and Multi-level Curation and ApprovalApproval

Page 25: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Crowd-sourcing ChemistryCrowd-sourcing Chemistry

Crowd-sourced curation: identify and tag Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify errors, edit names, synonyms, identify records for deprecationrecords for deprecation

ALSOALSO

Crowd-sourced deposition: anyone can Crowd-sourced deposition: anyone can deposit data (structures, text, images, deposit data (structures, text, images, analytical data)analytical data)

Page 26: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

VancomycinVancomycin

Originally 12 structures Originally 12 structures with vancomycin with vancomycin Incomplete Incomplete

stereochemistrystereochemistry Complete but different Complete but different

stereochemistrystereochemistry Different charge statesDifferent charge states

1 remains after 1 remains after community community collaboration with collaboration with ChEBIChEBI

Page 27: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

““Collaboration” with ChEBICollaboration” with ChEBI

Page 28: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Ginkgolide BGinkgolide B

Page 29: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

DailyMedDailyMed

Page 30: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Quality of StructuresQuality of Structures

Page 31: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Quality of Structures!!!Quality of Structures!!!

Page 32: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Page 33: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

““Entity Extraction”Entity Extraction”

Rule-based recognition of systematic names:Rule-based recognition of systematic names: Use a lexeme of name fragmentsUse a lexeme of name fragments Rules for identifying bounds of a nameRules for identifying bounds of a name

Look-up dictionary:Look-up dictionary: Drug NamesDrug Names Trivial NamesTrivial Names Numbers : Registry IDs, EINECS/ELINCSNumbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated Massive look-up dictionary of validated

identifiers on ChemSpideridentifiers on ChemSpider

Page 34: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Name RecognitionName Recognition

Azo aldehyde Azo aldehyde 22  was  synthesized according to a   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo reported  method [17]. To  a stirred  solution  of azo aldehyde aldehyde 22  (1.08 g, 3.76 mmol )  in  dry CH2Cl2    (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-(30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone diaminophenyl)phenyl methanone 11(0.40 g, 1.88 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . g,16.67 mmol) .

The resulting  mixture  was  stirred  for  6 hours  at The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo evaporated under reduced pressure to  give azo Schiff base Schiff base 33  as a red solid which was recrystalized   as a red solid which was recrystalized from ethanol 95%  from ethanol 95%    (1.28 g, 91 %) (1.28 g, 91 %)

Page 35: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Name RecognitionName Recognition

Azo aldehyde Azo aldehyde 22  was  synthesized according to a   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo reported  method [17]. To  a stirred  solution  of azo aldehyde aldehyde 22  (1.08 g, 3.76 mmol )  in  dry   (1.08 g, 3.76 mmol )  in  dry CH2Cl2CH2Cl2    (30.00 mL) at  0 oC  were  successively  added (30.00 mL) at  0 oC  were  successively  added (3,4-(3,4-diaminophenyl)phenyl methanonediaminophenyl)phenyl methanone 11(0.40 g, 1.88 (0.40 g, 1.88 mmol) and a excess of anhydrous mmol) and a excess of anhydrous MgSOMgSO44 (2.00 (2.00 g,16.67 mmol) . g,16.67 mmol) .

The resulting  mixture  was  stirred  for  6 hours  at The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and room temperature [18]. The mixture was  filtered and washed with washed with dichloromethanedichloromethane . Then the solvent . Then the solvent was  evaporated under reduced pressure to  give azo was  evaporated under reduced pressure to  give azo Schiff base Schiff base 33  as a red solid which was recrystalized   as a red solid which was recrystalized from from ethanolethanol 95%  95%    (1.28 g, 91 %) (1.28 g, 91 %)

Page 36: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

ChemMantisChemMantis

ChemChemical ical MMarkup arkup AAnd nd NNomenclature omenclature TTransformation ransformation IIntegrated ntegrated SSystemystem

Page 37: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Document markupDocument markup

Page 38: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Markup – 3 seconds!Markup – 3 seconds!

Page 39: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

On the fly conversionOn the fly conversion

Page 40: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Shorthand Formulae Shorthand Formulae SupportedSupported

Page 41: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

One Click to more Info…One Click to more Info…

Page 42: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Names and StructuresNames and Structures

DichloroacetoneDichloroacetone

TrichloromethylsilaneTrichloromethylsilane

Page 43: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

AmbiguityAmbiguity

Page 44: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Ambiguity in Abbreviations - Ambiguity in Abbreviations - DPADPA

Page 45: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

IUPAC PAC ArticlesIUPAC PAC Articles

Page 46: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

PatentsPatents

Page 47: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Single Configuration File Single Configuration File defines entities for markupdefines entities for markup

Algorithms can be built for Algorithms can be built for certain entities but the certain entities but the majority are dictionaries – majority are dictionaries – vendors, Phys Properties, vendors, Phys Properties, AnalyticalAnalytical

We can extend our system – We can extend our system – should we integrate to PDB should we integrate to PDB somehow?somehow?

Page 48: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Nature PublicationsNature Publications

Page 49: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Entity BalloonsEntity Balloons

Structures are the Structures are the language of language of chemistrychemistry

Show structures to Show structures to chemists and chemists and search/link from search/link from therethere

Link to PDBLink to PDB??

Page 50: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Other Dictionaries - SpeciesOther Dictionaries - Species

We are considering We are considering BacteriaBacteria FungiFungi EnzymesEnzymes VirusesViruses PDB codes?PDB codes?

Page 51: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

Integrations Out to Other Integrations Out to Other SourcesSources

Page 52: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

ReactionsReactions

Page 53: Checking, Curating And Qualifying Chemistry

Building a Structure Centric Community for Chemists

ConclusionsConclusions

The quality of structure-based data online The quality of structure-based data online should always be questioned – that should always be questioned – that includes ChemSpiderincludes ChemSpider

Data on ChemSpider are being added and Data on ChemSpider are being added and curated on a daily basis but we need more curated on a daily basis but we need more eyeballs helping alwayseyeballs helping always

ChemSpider has a large validated ChemSpider has a large validated structure-name dictionarystructure-name dictionary

Chemical name extraction and document Chemical name extraction and document markup is very enablingmarkup is very enabling