Overview
description
Transcript of Overview
![Page 1: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/1.jpg)
Delivering an online service for validating and standardizing chemical structure files using the ChemSpider platform
![Page 2: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/2.jpg)
Overview• Introduction– Why do we need to
validate/standardise data– Examples of problems in general– Examples of Problems in
ChemSpider– Why InChI is not enough– FDA rules
![Page 3: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/3.jpg)
What are we trying to achieve?• Everyone wants high quality data• The ChemSpider team is building a reputation on
data quality• Many datasources have errors• We need to identify:– Errors– Inconsistencies– Data duplication/Inappropriate separation of data
• Requires a process of validation and standardization
![Page 4: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/4.jpg)
What do we mean by Validation and Standardisation?
• Validated– Check for hypervalency, charge balance, missing
stereo– Name-Structure relationships, etc.
• Standardized– Use standard rules to “standardize” compounds;
Nitro groups, O-Metal bonds, tautomers, etc.
![Page 5: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/5.jpg)
Where will CVSP be useful
• Currently, a standalone system
• In the future; Validation/standardisation routines will be used: – Built in to our deposition system– At registration for new compounds– To improve existing data in ChemSpider – pass through the
ChemSpider backfile
• Potential to offer optional checking service to authors
![Page 6: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/6.jpg)
What we want to avoid
![Page 7: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/7.jpg)
What do we do now?
• Currently, ChemSpider uses structures (as InChI’s) as the database key
• Need structures for depositions• 2 Steps: – Pre-processing prior to deposition– InChI algorithm; provides standardisation and
mapping
![Page 8: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/8.jpg)
What are the common errors?
• Records without a structure
• Incorrect valences
• Atom labels
![Page 9: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/9.jpg)
What are the common errors?• Unbalanced charge– Name-structure errors
• Salts
• Polymers/Organometallics
• Missing stereochemistry
![Page 10: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/10.jpg)
Side Effects of InChI on ChemSpider: Sort of helpful
![Page 11: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/11.jpg)
Side Effects of InChI on ChemSpider
• Advantages and disadvantages– The depictions are meant to represent the same molecule– Not easy to pick out “bad” representations
![Page 12: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/12.jpg)
Substance Registry System
• How do you decide your standardisation rules?• Avoid standards in isolation
http://www.fda.gov/downloads/ForIndustry/DataStandards/SubstanceRegistrationSystem-UniqueIngredientIdentifierUNII/ucm127743.pdf
• Note: This document is only a starting point
![Page 13: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/13.jpg)
Salt and Ionic Bonds
![Page 14: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/14.jpg)
Nitro groups
![Page 15: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/15.jpg)
Ammonium salts
![Page 16: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/16.jpg)
Validation rules
In XML:
Code generated dynamically from rule set.Indigo API used behind the scenes.
![Page 17: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/17.jpg)
Standardization rules
Corrections stored in database:
SMIRKS-based corrections and also proximity-based metal–non-metal reconnection.
![Page 18: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/18.jpg)
Case study: DrugBank
• DrugBank (http://www.drugbank.ca/) maintained by David Wishart
• Database contains 6711 structures• Widely regarded as a well curated, high quality
dataset
DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Knox C, Law V,Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R,Guo AC, Wishart DS., Nucleic Acids Res., 2011, 39, Jan, D1035-41.
![Page 19: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/19.jpg)
![Page 20: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/20.jpg)
ChemSpider Standardization
• Entire ChemSpider database will be standardized using modified FDA rule set
• Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated
• Standardization procedures automatically applied to all future depositions
![Page 21: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/21.jpg)
CVSP as a Flexible System
• There will be various rules sets – Rigid pre-defined rules: e.g. Meeting FDA specifications
as written, Open PHACTS modified rules set, etc.
– Flexible user-defined rules: users upload their rules in our custom format (XML)
– The Open PHACTS rule set will be open to the community to reuse
![Page 22: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/22.jpg)
Incorporating CVSP into data processing platforms: Knime
• The workflow includes:– SDF reader– Indigo nodes– calls for ChemSpider
validation Web services
![Page 23: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/23.jpg)
Incorporating CVSP into data processing platforms: Knime
• Warning is returned as a result of processing
![Page 24: Overview](https://reader036.fdocuments.in/reader036/viewer/2022062310/568164a1550346895dd69348/html5/thumbnails/24.jpg)
Summary
• Will release back results of DrugBank • Alpha version of CVSP available:
http://cv.beta.rsc-us.org/Batches.aspx • Will be a resource for the Community• Will improve ChemSpider• Still a long way to go….