Checking, Curating And Qualifying Chemistry
-
Upload
orcid-0000-0002-2668-4821 -
Category
Technology
-
view
1.215 -
download
0
description
Transcript of Checking, Curating And Qualifying Chemistry
Checking, Curating and Checking, Curating and Qualifying Chemistry to Build Qualifying Chemistry to Build
a Structure Centric a Structure Centric Community for ChemistsCommunity for Chemists
Rutgers University 12/2/2008Rutgers University 12/2/2008
Antony WilliamsAntony Williams
Building a Structure Centric Community for Chemists
ChemSpider - A Search Engine for ChemSpider - A Search Engine for ChemistsChemists
Questions a chemist might ask…Questions a chemist might ask… What is the melting point of n-butanol? What is the melting point of n-butanol? What is the chemical structure of Xanax?What is the chemical structure of Xanax? Chemically, what is phenolphthalein?Chemically, what is phenolphthalein? What are the stereocenters of cholesterol?What are the stereocenters of cholesterol? Where can I find publications about xylene?Where can I find publications about xylene? What are the different trade names for What are the different trade names for
Ketoconazole?Ketoconazole? What is the NMR spectrum of Aspirin?What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol What are the safety handling issues for Thymol
Blue?Blue?
ChemSpider can answer all of these questionsChemSpider can answer all of these questions
Building a Structure Centric Community for Chemists
Tell Me About GlutathioneTell Me About Glutathione
Building a Structure Centric Community for Chemists
Tell Me About GlutathioneTell Me About Glutathione
Building a Structure Centric Community for Chemists
Tell Me About GlutathioneTell Me About Glutathione
Building a Structure Centric Community for Chemists
Tell Me About GlutathioneTell Me About Glutathione
Building a Structure Centric Community for Chemists
Tell Me About GlutathioneTell Me About Glutathione
Building a Structure Centric Community for Chemists
Link outsLink outs
Building a Structure Centric Community for Chemists
Links out to KEGGLinks out to KEGGKyoto Encyclopedia of Genes and Kyoto Encyclopedia of Genes and
Genomes Genomes
Building a Structure Centric Community for Chemists
How many names does a How many names does a compound have?compound have?
Building a Structure Centric Community for Chemists
ChemSpider Data ContentChemSpider Data Content
Over 21.5 million unique chemical structures from Over 21.5 million unique chemical structures from ca. 150 data sourcesca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, WikipediaOnline Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Literature – PubMed, J Het Chem, Nature, RSC, Open
AccessAccess Chemical Vendors – over 40 different vendors and Chemical Vendors – over 40 different vendors and
growinggrowing Personal Depositions – individual contributionsPersonal Depositions – individual contributions Content database vendorsContent database vendors Analytical data collectionsAnalytical data collections PatentsPatents Web scrapingWeb scraping
Content is linked back to the original data sourcesContent is linked back to the original data sources
Building a Structure Centric Community for Chemists
Complex SearchComplex Search
Building a Structure Centric Community for Chemists
The Quality of Data Online…The Quality of Data Online…
Aggregating data opens up quality issuesAggregating data opens up quality issues Structure-identifier associations are “dirty”Structure-identifier associations are “dirty” Structures are COMMONLY incorrectStructures are COMMONLY incorrect Manual curation of small databases is enough Manual curation of small databases is enough
work – what about millions of structures?work – what about millions of structures? Structures are far from perfect. What is a Structures are far from perfect. What is a
“correct structure”?“correct structure”? Full stereochemistry? Full stereochemistry? Historical timeline of structure?Historical timeline of structure? Who is the authority?Who is the authority?
Building a Structure Centric Community for Chemists
Quality is a Major Issue- Search Quality is a Major Issue- Search ButanolButanol
OLD EXAMPLE..now fixedOLD EXAMPLE..now fixed
Building a Structure Centric Community for Chemists
Wikipedia Chemistry Curation Wikipedia Chemistry Curation projectproject
Only ca. 5000 organic structures, Only ca. 5000 organic structures, 7000 total structures7000 total structures
Almost a year of work so far for a Almost a year of work so far for a team of 6 peopleteam of 6 people
Many errors removed in the process. Many errors removed in the process. Curation process is a daily event for Curation process is a daily event for users/depositorsusers/depositors
Slow and torturous processSlow and torturous process
http://en.wikipedia.org/wiki/Talk:Tachttp://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structurrolimus#IUPAC_Name_and_structuree
Building a Structure Centric Community for Chemists
Wikipedia CurationWikipedia Curation
Looking for self-Looking for self-consistency across a consistency across a Wikipedia PageWikipedia Page
Primary key is the article Primary key is the article TITLETITLE
The chemical shown The chemical shown needs to match the titleneeds to match the title
Cyclic self-consistency – Cyclic self-consistency – and decisions must get and decisions must get mademade
Building a Structure Centric Community for Chemists
Other issues…Other issues…
Building a Structure Centric Community for Chemists
ChargesCharges
Building a Structure Centric Community for Chemists
Sugars – Machine Readable vs Sugars – Machine Readable vs AestheticsAesthetics
Haworth Stereo Fischer
Building a Structure Centric Community for Chemists
Wikipedia – Crowdsourcing Wikipedia – Crowdsourcing ChemistryChemistry
Building a Structure Centric Community for Chemists
Thymol Blue on ChemSpiderThymol Blue on ChemSpider
Data online includes:Data online includes: UV-vis spectrumUV-vis spectrum Measured experimental propertiesMeasured experimental properties Link to Wikipedia articleLink to Wikipedia article Links to chromatography detailsLinks to chromatography details Multiple identifiers/trade names etc.Multiple identifiers/trade names etc. Links to vendors/suppliers/other databasesLinks to vendors/suppliers/other databases Safety informationSafety information
http://www.chemspider.com/q/thymol%20bluehttp://www.chemspider.com/q/thymol%20blue
Building a Structure Centric Community for Chemists
Crowd-sourcing CurationCrowd-sourcing Curation
How to curate data for millions of How to curate data for millions of structures? structures?
Robot processes can clean up depositionsRobot processes can clean up depositions Search for Chloride and check molecular formula Search for Chloride and check molecular formula
for Clfor Cl Check for stereochemistry and remove names Check for stereochemistry and remove names
with stereo with stereo Provide a simple-to-use platform to curate, Provide a simple-to-use platform to curate,
annotate and tag data annotate and tag data Provide curator administration to prevent Provide curator administration to prevent
vandalism (Veropedia)vandalism (Veropedia)
Building a Structure Centric Community for Chemists
Post CommentsPost Comments Anyone can “Post Comments” associated Anyone can “Post Comments” associated
with a structure. To curate data we with a structure. To curate data we require login to trackrequire login to track
Building a Structure Centric Community for Chemists
Multi-level Curation and Multi-level Curation and ApprovalApproval
Building a Structure Centric Community for Chemists
Crowd-sourcing ChemistryCrowd-sourcing Chemistry
Crowd-sourced curation: identify and tag Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify errors, edit names, synonyms, identify records for deprecationrecords for deprecation
ALSOALSO
Crowd-sourced deposition: anyone can Crowd-sourced deposition: anyone can deposit data (structures, text, images, deposit data (structures, text, images, analytical data)analytical data)
Building a Structure Centric Community for Chemists
VancomycinVancomycin
Originally 12 structures Originally 12 structures with vancomycin with vancomycin Incomplete Incomplete
stereochemistrystereochemistry Complete but different Complete but different
stereochemistrystereochemistry Different charge statesDifferent charge states
1 remains after 1 remains after community community collaboration with collaboration with ChEBIChEBI
Building a Structure Centric Community for Chemists
““Collaboration” with ChEBICollaboration” with ChEBI
Building a Structure Centric Community for Chemists
Ginkgolide BGinkgolide B
Building a Structure Centric Community for Chemists
DailyMedDailyMed
Building a Structure Centric Community for Chemists
Quality of StructuresQuality of Structures
Building a Structure Centric Community for Chemists
Quality of Structures!!!Quality of Structures!!!
Building a Structure Centric Community for Chemists
Building a Structure Centric Community for Chemists
““Entity Extraction”Entity Extraction”
Rule-based recognition of systematic names:Rule-based recognition of systematic names: Use a lexeme of name fragmentsUse a lexeme of name fragments Rules for identifying bounds of a nameRules for identifying bounds of a name
Look-up dictionary:Look-up dictionary: Drug NamesDrug Names Trivial NamesTrivial Names Numbers : Registry IDs, EINECS/ELINCSNumbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated Massive look-up dictionary of validated
identifiers on ChemSpideridentifiers on ChemSpider
Building a Structure Centric Community for Chemists
Name RecognitionName Recognition
Azo aldehyde Azo aldehyde 22 was synthesized according to a was synthesized according to a reported method [17]. To a stirred solution of azo reported method [17]. To a stirred solution of azo aldehyde aldehyde 22 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-(30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone diaminophenyl)phenyl methanone 11(0.40 g, 1.88 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo evaporated under reduced pressure to give azo Schiff base Schiff base 33 as a red solid which was recrystalized as a red solid which was recrystalized from ethanol 95% from ethanol 95% (1.28 g, 91 %) (1.28 g, 91 %)
Building a Structure Centric Community for Chemists
Name RecognitionName Recognition
Azo aldehyde Azo aldehyde 22 was synthesized according to a was synthesized according to a reported method [17]. To a stirred solution of azo reported method [17]. To a stirred solution of azo aldehyde aldehyde 22 (1.08 g, 3.76 mmol ) in dry (1.08 g, 3.76 mmol ) in dry CH2Cl2CH2Cl2 (30.00 mL) at 0 oC were successively added (30.00 mL) at 0 oC were successively added (3,4-(3,4-diaminophenyl)phenyl methanonediaminophenyl)phenyl methanone 11(0.40 g, 1.88 (0.40 g, 1.88 mmol) and a excess of anhydrous mmol) and a excess of anhydrous MgSOMgSO44 (2.00 (2.00 g,16.67 mmol) . g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and room temperature [18]. The mixture was filtered and washed with washed with dichloromethanedichloromethane . Then the solvent . Then the solvent was evaporated under reduced pressure to give azo was evaporated under reduced pressure to give azo Schiff base Schiff base 33 as a red solid which was recrystalized as a red solid which was recrystalized from from ethanolethanol 95% 95% (1.28 g, 91 %) (1.28 g, 91 %)
Building a Structure Centric Community for Chemists
ChemMantisChemMantis
ChemChemical ical MMarkup arkup AAnd nd NNomenclature omenclature TTransformation ransformation IIntegrated ntegrated SSystemystem
Building a Structure Centric Community for Chemists
Document markupDocument markup
Building a Structure Centric Community for Chemists
Markup – 3 seconds!Markup – 3 seconds!
Building a Structure Centric Community for Chemists
On the fly conversionOn the fly conversion
Building a Structure Centric Community for Chemists
Shorthand Formulae Shorthand Formulae SupportedSupported
Building a Structure Centric Community for Chemists
One Click to more Info…One Click to more Info…
Building a Structure Centric Community for Chemists
Names and StructuresNames and Structures
DichloroacetoneDichloroacetone
TrichloromethylsilaneTrichloromethylsilane
Building a Structure Centric Community for Chemists
AmbiguityAmbiguity
Building a Structure Centric Community for Chemists
Ambiguity in Abbreviations - Ambiguity in Abbreviations - DPADPA
Building a Structure Centric Community for Chemists
IUPAC PAC ArticlesIUPAC PAC Articles
Building a Structure Centric Community for Chemists
PatentsPatents
Building a Structure Centric Community for Chemists
Single Configuration File Single Configuration File defines entities for markupdefines entities for markup
Algorithms can be built for Algorithms can be built for certain entities but the certain entities but the majority are dictionaries – majority are dictionaries – vendors, Phys Properties, vendors, Phys Properties, AnalyticalAnalytical
We can extend our system – We can extend our system – should we integrate to PDB should we integrate to PDB somehow?somehow?
Building a Structure Centric Community for Chemists
Nature PublicationsNature Publications
Building a Structure Centric Community for Chemists
Entity BalloonsEntity Balloons
Structures are the Structures are the language of language of chemistrychemistry
Show structures to Show structures to chemists and chemists and search/link from search/link from therethere
Link to PDBLink to PDB??
Building a Structure Centric Community for Chemists
Other Dictionaries - SpeciesOther Dictionaries - Species
We are considering We are considering BacteriaBacteria FungiFungi EnzymesEnzymes VirusesViruses PDB codes?PDB codes?
Building a Structure Centric Community for Chemists
Integrations Out to Other Integrations Out to Other SourcesSources
Building a Structure Centric Community for Chemists
ReactionsReactions
Building a Structure Centric Community for Chemists
ConclusionsConclusions
The quality of structure-based data online The quality of structure-based data online should always be questioned – that should always be questioned – that includes ChemSpiderincludes ChemSpider
Data on ChemSpider are being added and Data on ChemSpider are being added and curated on a daily basis but we need more curated on a daily basis but we need more eyeballs helping alwayseyeballs helping always
ChemSpider has a large validated ChemSpider has a large validated structure-name dictionarystructure-name dictionary
Chemical name extraction and document Chemical name extraction and document markup is very enablingmarkup is very enabling