Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The...

6
Validation of the KnowItAll ® Stereochemistry Toolkit Ty Abshear , Gregory Banik, Ph.D., Sonali Dalvi, Michelle D’Souza, Ph.D., Keith Kunitsky, Karl Nedwed Bio-Rad Laboratories, Inc., Informatics Division, 2000 Market Street, Suite 1460, Philadelphia,PA, 19103-3212, USA Stereochemistry 210434 Background Chemists draw chemical structures using various styles or conventions, some of which have been used for over a century. While chemists are trained to recognize the sometimes subtle 3D information implicit in a 2D chemical structure drawing, chemical information software often fails to recognize stereochemical information. Recent advances such as the IUPAC International Chemical Identifier (InChI) 5 have made the problem much greater by allowing large databases to be interconnected by standardized structures, which can introduce major errors. In some cases, there may not be a way to standardize a complex chemical structure. Many conventions such as chair/boat representations, 6 Fischer projections, 7 Haworth projections, 8 and 2D or Pseudo 3D projections are used by chemists to represent 3D structures in 2D media. However, these different styles create difficulties for computer interpretation. For example, the InChI code only perceives stereocenters at the point of a “wedge” or “hashed wedge” bond. Unfortunately, all of the stereochemical representations mentioned above use simple single bonds. Chemists are taught to recognize and interpret the 2D information as 3D. Designing software code to recognize this information is far more difficult. That said, Bio-Rad’s KnowItAll ® 2018 software has been redesigned to accurately interpret traditional 2D representations of 3D molecules by adding the ability to understand the sometimes subtle 3D intentions in a 2D drawing. 9 The current Tech Note validation study uses PubChem data to validate KnowItAll Stereochemistry Toolkit, which has been created to allow the technology to be embedded in cheminformatics workflows. Methods The InChI, which has layers that encode the entire chemical structure including atoms, bonds, stereochemistry, as well as charges, and isotopes and therefore is of variable length. All InChI strings can also be represented as a 27 character InChIKey, which is a hashed 14 letter representation of the skeletal information of a structure followed by a dash followed by a hashed 10 character representation of the stereochemistry, charge, and isotope layers followed by another dash and a single character that designates the type of InChI that was generated. Comparing InChIKeys, therefore, is a fast and efficient way to determine if compounds are identical. Furthermore, comparing only the first 14 digits of InChIKeys is a fast and efficient way to determine if compounds have the same skeleton with possible differences in stereochemistry, charge, or isotopes. This study uses data from the PubChem Database of the United States National Institutes of Health (NIH). 10 PubChem has received chemical structure contributions from over 500 data sources, so the chemical structures contributed to the PubChem Substance Database may have been drawn by hundreds or thousands of different chemists. PubChem stores the chemical structure as it was originally contributed and matches it as best it can to the appropriate PubChem Compound Database record. 11 This makes the chemical structures in PubChem a goldmine for surveying how chemists Abstract Bio-Rad’s KnowItAll ® Stereochemistry Toolkit 1 (TK) provides a fast, efficient, and highly reliable tool for the interpretation of implicit stereochemical information in traditionally-drawn structures that allows them to be used for indexing, searching, and other forms of processing. In addition, the Bio-Rad Stereochemistry TK generates accurate Cahn-Ingold-Prelog (CIP) stereodescriptors defined in the IUPAC Nomenclature of Organic Chemistry (Blue Book) for all descriptor classes (R/S, E/Z, M/P) of organic molecules. 2 For each of the following stereochemistry categories, chair/boat representations, Haworth projections, Fischer projections, 2.5D projections, we used 150 matched pairs from the PubChem Substance Database 3 and the PubChem Compound Database 4 to validate the toolkit.

Transcript of Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The...

Page 1: Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The toolkit identified the implicit stereochemistry from 2D representations and added

Validation of the KnowItAll® Stereochemistry ToolkitTy Abshear , Gregory Banik, Ph.D., Sonali Dalvi, Michelle D’Souza, Ph.D., Keith Kunitsky, Karl Nedwed Bio-Rad Laboratories, Inc., Informatics Division, 2000 Market Street, Suite 1460, Philadelphia,PA, 19103-3212, USA

Stereochemistry 210434

Background

Chemists draw chemical structures using various styles orconventions, some of which have been used for over a century.While chemists are trained to recognize the sometimes subtle3D information implicit in a 2D chemical structure drawing,chemical information software often fails to recognizestereochemical information.

Recent advances such as the IUPAC International ChemicalIdentifier (InChI)5 have made the problem much greater byallowing large databases to be interconnected by standardizedstructures, which can introduce major errors. In some cases,there may not be a way to standardize a complex chemicalstructure.

Many conventions such as chair/boat representations,6 Fischerprojections,7 Haworth projections,8 and 2D or Pseudo 3Dprojections are used by chemists to represent 3D structures in2D media. However, these different styles create difficulties forcomputer interpretation. For example, the InChI code onlyperceives stereocenters at the point of a “wedge” or “hashedwedge” bond. Unfortunately, all of the stereochemicalrepresentations mentioned above use simple single bonds.Chemists are taught to recognize and interpret the 2Dinformation as 3D. Designing software code to recognize thisinformation is far more difficult.

That said, Bio-Rad’s KnowItAll® 2018 software has beenredesigned to accurately interpret traditional 2D representationsof 3D molecules by adding the ability to understand thesometimes subtle 3D intentions in a 2D drawing.9 The current

TechNote

validation study uses PubChem data to validate KnowItAllStereochemistry Toolkit, which has been created to allow thetechnology to be embedded in cheminformatics workflows.

Methods

The InChI, which has layers that encode the entire chemicalstructure including atoms, bonds, stereochemistry, as well ascharges, and isotopes and therefore is of variable length. AllInChI strings can also be represented as a 27 characterInChIKey, which is a hashed 14 letter representation of theskeletal information of a structure followed by a dash followedby a hashed 10 character representation of thestereochemistry, charge, and isotope layers followed byanother dash and a single character that designates the typeof InChI that was generated. Comparing InChIKeys, therefore,is a fast and efficient way to determine if compounds areidentical. Furthermore, comparing only the first 14 digits ofInChIKeys is a fast and efficient way to determine ifcompounds have the same skeleton with possible differencesin stereochemistry, charge, or isotopes.

This study uses data from the PubChem Database of theUnited States National Institutes of Health (NIH).10 PubChemhas received chemical structure contributions from over 500data sources, so the chemical structures contributed to thePubChem Substance Database may have been drawn byhundreds or thousands of different chemists. PubChem storesthe chemical structure as it was originally contributed andmatches it as best it can to the appropriate PubChemCompound Database record.11 This makes the chemicalstructures in PubChem a goldmine for surveying how chemists

Abstract

Bio-Rad’s KnowItAll® Stereochemistry Toolkit1 (TK) provides a fast, efficient, and highly reliable tool for the interpretation of implicitstereochemical information in traditionally-drawn structures that allows them to be used for indexing, searching, and other formsof processing. In addition, the Bio-Rad Stereochemistry TK generates accurate Cahn-Ingold-Prelog (CIP) stereodescriptorsdefined in the IUPAC Nomenclature of Organic Chemistry (Blue Book) for all descriptor classes (R/S, E/Z, M/P) of organicmolecules.2 For each of the following stereochemistry categories, chair/boat representations, Haworth projections, Fischerprojections, 2.5D projections, we used 150 matched pairs from the PubChem Substance Database3 and the PubChemCompound Database4 to validate the toolkit.

Page 2: Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The toolkit identified the implicit stereochemistry from 2D representations and added

draw chemical structures. Using a maximum diversityalgorithm, 150 matched pairs of PubChem Substances andPubChem Compounds were selected in each of the followingcategories: chair/boat representations, Haworth projections,Fischer projections, 2.5D projections to validate the KnowItAllStereochemistry Toolkit.

The structures of PubChem Substances were directly input intothe KnowItAll Stereochemistry Toolkit. The toolkit identified theimplicit stereochemistry from 2D representations and addedappropriate stereobonds to allow the InChI code to recognizestereochemistry at a stereocenter that it otherwise would nothave recognized. Once the appropriate stereobonds wereadded, InChIs and InChIKeys were generated using thestandard InChI code. The structures of PubChem Compoundswere then given to four chemists to add explicit stereobonds tothe implicit stereocenters. These four manually-assigned resultswere compared to create consensus. In the end, InChIKeyswere generated. The InChIKeys from chemists’ consensus andKnowItAll Stereochemistry Toolkit were compared. A matchwas counted when human and computer interpretations wereidentical and a mismatch was counted when human andcomputer interpretations were different. Mismatches wereexamined manually to determine whether the chemist or thecomputer was correct.

Results

The validation data from the entire study are summarized inTable 1. The KnowItAll Stereochemistry Toolkit yields thecorrect stereochemical interpretation 98.5% of the time and isclose to perfect for its R/S assignments at 99.6%. Mistakenassignments were very low at 0.4%. Considering the amount oftime it takes for a chemist to manually assign each stereocenterfor each structure, the toolkit proves incredibly efficient.

The following are several examples from the study todemonstrate the complexities of stereochemical interpretationusing the KnowItAll Stereochemistry Toolkit.

In Figure 1, the implicit stereochemistry (red question mark) isnot recognized by the InChI software12 used by PubChem;therefore, the stereochemistry portion of InChIKey (red text) isincorrect. The chemists and the KnowItAll StereochemistryToolkit, on the other hand, interpret the implicit stereochemistryin the PubChem Substance structure and generates the correctInChIKey.

Figure 1. Haworth projection example.Table 1. KnowItAll Stereochemistry Toolkit validation study statistics.

210434© 2017 Bio-Rad Laboratories, Inc.

Preliminary Validation Results

Number of stereocenters: 1,663Number of matching stereocenters overall: 1,638 (98.5%)Number of matching R/S assignments: 1,626 (99.6%)Number of missed stereocenters: 15 (0.9%)Number of surplus stereocenters: 3 (0.2%)

Number of stereocenters with reversed R/S assignments: 7 (0.4%)

Page 3: Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The toolkit identified the implicit stereochemistry from 2D representations and added

In Figure 3, there are no wedge bonds used in the Fischerprojections of the PubChem Substance or the PubChemCompound, which prevents the InChI software fromrecognizing the implicit stereochemistry. (See InChI Key in red.)The chemists and the KnowItAll Stereochemistry Toolkit,however, interpret the PubChem Substance Database structureand generate the correct InChIKey, as displayed in the bottomstructures.

In Figure 2, the PubChem Substance was erroneously drawnusing three foreshortened bonds in the top left structure, wherea single bold bond should be used for the center bond. TheInChI software recognizes stereochemistry only at the tip orhash of the wedge bond; therefore, one stereocenter is notrecognized and the other is wrong. (See InChI Key in red.) Boththe chemists and the KnowItAll Stereochemistry Toolkit,however, were able to interpret the implicit stereochemistrydirectly from PubChem Substance structure and generate thecorrect InChIKey.

Figure 2. Haworth projection example with erroneous bond.

Figure 3. Fischer projection example.

210434© 2017 Bio-Rad Laboratories, Inc.

Page 4: Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The toolkit identified the implicit stereochemistry from 2D representations and added

The InChI software used by PubChem does not recognize theimplicit stereochemistry of the chair representation of thePubChem Substance or the standardized representation of thePubChem Compound in Figure 4. Both the chemists and theKnowItAll Stereochemistry Toolkit, however, recognize theimplicit stereocenters and result in the correct InChIKey.

To avoid any ambiguities caused by overlapping bonds, thechemists drew the structures in the bottom left structure inFigure 5 with no overlapping bonds. The InChI code used byPubChem fails to recognize the implicit stereochemistry of thetwo 2.5D projections at the top, but the chemists and theKnowItAll Stereochemistry Toolkit successfully recognize thestereochemistry in the PubChem Substance, generating thecorrect InChIKey.

Figure 4. Cyclohexane chair example.

Figure 5. Simple 2.5D projection example.

210434© 2017 Bio-Rad Laboratories, Inc.

Page 5: Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The toolkit identified the implicit stereochemistry from 2D representations and added

Figure 6 showcases the potential complexity of the implicitstereochemistry problem with complex structures as well as thetremendous benefit gained by using the KnowItAllStereochemistry Toolkit. Once again, to avoid any ambiguitiescaused by overlapping bonds, the chemists drew the structuresin the bottom-left structure in Figure 6 with no overlappingbonds. While no chemist would draw structures in this way, itdemonstrates that the chemists and the KnowItAllStereochemistry toolkit can accurately interpret thestereochemistry of this complex structure, and that the InChIcode cannot.

Figure 7 shows an example where the toolkit and the chemistsdisagree. This happens when a stereocenter in a chemicalstructure is drawn in an ambiguous way. The chemists in thisstudy interpreted the stereocenter circled in the figure as “R”,whereas the KnowItAll Stereochemistry Toolkit determined thatstereocenter was ambiguous. This type of “gray area” canhappen if structures are not drawn more precisely andunambiguously.

Conclusion

This study demonstrates that it is possible for software toaccurately interpret the sometimes very subtle implicitstereochemical intent of 2D chemical structure representationsin the same way that a chemist would. It also highlights theproblems and dangers associated with chemical structurestandardization software. The KnowItAll Stereochemistry Toolkitaccurately, reliably, and efficiently interprets traditional structuredrawing styles such as chair/boat representations, Fischerprojections, Haworth projections, and 2.5D projections andavoids errors caused by structure standardization.

Figure 6. Complex 2.5D projection example.

Figure 7. Ambiguous stereochemistry example.

210434© 2017 Bio-Rad Laboratories, Inc.

Page 6: Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The toolkit identified the implicit stereochemistry from 2D representations and added

References1 KnowItAll Stereochemistry Toolkit, version 1.0 [Computer Software] (2017). Bio-Rad Laboratories, Inc. Informatics Division: Philadelphia.

2 Favre, H.A. and Powell, W.H. (2013). Nomenclature of Organic Chemistry,IUPAC Recommendations and Preferred Names 2013. (London: Royal Societyof Chemistry), Ch. 9.

3 PubChem Substance Database, National Center for Biotechnology Informationof the National Institutes of Health. http://www.ncbi.nlm.nih.gov/pcsubstance,accessed August 2017.

4 PubChem Compound Database, National Center for Biotechnology Informationof the National Institutes of Health. http://www.ncbi.nlm.nih.gov/pccompound,accessed August 2017.

5 The IUPAC International Chemical Identifier (InChI). https://iupac.org/who-weare/divisions/division-details/inchi/, accessed November 1, 2017.

6 Sachse, H. (1890). Über die Geometrischen Isomerien derHexamethylenderivate. Ber. Dtsch. Chem. Ges. 23: 1363–1370.doi:10.1002/cber.189002

7 Fischer, E. (1891). Über die Configuration des Traubenzuckers und seinerIsomeren. Ber. Dtsch. Chem. Ges. 24: 1836–1845.doi:10.1002/cber.189102401311

8 Haworth, W. N., et al. (1926). Organic Chemistry. Annu. Rep. Prog. Chem. 23:74-185. doi:10.1039/AR9262300074

9 Bio-Rad Laboratories, Inc. (2017). KnowItAll Stereochemistry Toolkit ProductInformation Sheet. http://www.knowitall.com/stereochem, accessed November1, 2017.

10 National Center for Biotechnology Information of the National Institutes ofHealth (2017). About PubChem. https://pubchemdocs.ncbi.nlm.nih.gov/about,accessed November 1, 2017).

11 Kim, S., et al. (2016). PubChem Substance and Compound databases.Nucleic Acids Research. 44 (Database issue): D1202–D1213.

12 InChI software version, version 1.05 [Computer Software] (2017). Inchi Trust:Cambridge, UK.

210434-REV20171130

Bio-RadLaboratories, Inc.

Informatics Division Website: www.knowitall.com Contact Us: www.knowitall.com/contactus

© 2017 Bio-Rad Laboratories, Inc.