SDF File analysis Creation, composition, checking.

20
SDF File analysis Creation, composition, checking

Transcript of SDF File analysis Creation, composition, checking.

SDF File analysis

Creation, composition, checking

Concerning chemical table files

• Chemical table files are files that contain information about chemicals

• Various formatsRGfiles, Rxnfiles, RDfiles, XDfiles and ClipboardMolfile, SDF

MDL Molfile

• A file format for holding information about the atoms, bonds, connectivity and coordinates of a molecule

• Most cheminformatics and some computational softwares are able to read

• Standard version: V2000• Containing a header and a connection table

MDL Molfile contentGenerated by Molgen 5.0

11 9 0 0 0 0 -0.0666 -1.5989 0.0514 C 0 0 0 0 0 0 0 0 0 0 0 0 0 1.2913 -1.6184 -0.1221 C 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.9621 -1.2620 -0.9586 O 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.0783 1.8974 -0.4702 O 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.4844 1.6346 0.9333 O 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.5244 -1.8601 1.0528 H 0 0 0 0 0 0 0 0 0 0 0 0 0 1.7535 -1.3543 -1.1238 H 0 0 0 0 0 0 0 0 0 0 0 0 0 1.9833 -1.8974 0.7324 H 0 0 0 0 0 0 0 0 0 0 0 0 0 -1.9833 -1.2177 -0.8648 H 0 0 0 0 0 0 0 0 0 0 0 0 0 0.8090 1.5332 -0.8167 H 0 0 0 0 0 0 0 0 0 0 0 0 0 -1.3677 1.1615 1.1238 H 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 1 3 1 0 0 0 0 1 6 1 0 0 0 0 2 7 1 0 0 0 0 2 8 1 0 0 0 0 3 9 1 0 0 0 0 4 5 1 0 0 0 0 4 10 1 0 0 0 0 5 11 1 0 0 0 0M END$$$$

1-3 Header1 Molecule name

2 User/Program/Date/etc information

3 Comment (blank)

4-25 Connection table (Ctab)

4Counts line: 11 atoms, 9 bonds, ..., V2000 standard

5-15Atom block (1 line for each atom): x, y, z, element, etc.

16-25Bond block (1 line for each bond): 1st atom, 2nd atom, type, etc.

25 M END

26 $$$$ Delimiter character (only for SDF)

MDL SDF file

• SDF = structure-data file• Wraps the molfile format

SDF content §1 – molecular informations

./MinCheck/C2_H6_N0_O3_F0_S0_1.log OpenBabel04161413273DGaussian 09 # G3MP2B3 Opt(Cartesian,Tight,CalcAll,MaxStep=1,MaxCycles=300) QCISD 11 9 0 0 0 0 0 0 0 0999 V2000 0.4466 -1.5390 0.0292 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4790 -2.1676 -0.5273 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.2693 -0.5704 -0.6322 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.3941 2.0659 0.3307 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.5836 1.3451 0.7668 O 0 0 0 0 0 0 0 0 0 0 0 0 0.1141 -1.7508 1.0446 H 0 0 0 0 0 0 0 0 0 0 0 0 1.7979 -1.9482 -1.5413 H 0 0 0 0 0 0 0 0 0 0 0 0 2.0238 -2.9170 0.0345 H 0 0 0 0 0 0 0 0 0 0 0 0 -1.0239 -0.2837 -0.0806 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0506 1.3459 -0.1697 H 0 0 0 0 0 0 0 0 0 0 0 0 -2.2708 1.8377 0.2828 H 0 0 0 0 0 0 0 0 0 0 0 0 1 6 1 0 0 0 0 2 1 2 0 0 0 0 2 8 1 0 0 0 0 3 9 1 0 0 0 0 3 1 1 0 0 0 0 4 5 1 0 0 0 0 7 2 1 0 0 0 0 10 4 1 0 0 0 0 11 5 1 0 0 0 0M END

1-3 Header1 Filename

2 User/Program/Date/etc information

3 Command

4-25 Connection table (Ctab)

4Counts line: 11 atoms, 9 bonds, ..., V2000 standard

5-15Atom block (1 line for each atom): x, y, z, element, etc.

16-25Bond block (1 line for each bond): 1st atom, 2nd atom, type, etc.

25 M END

SDF content §2 – input and calculated parameters

> <Scale factor> 0.96

> <Stoichiometry> C2H6O3

> <Charge> 0

> <Multiplicity> 1

> <Molecular mass> 78.03169

> <DegreeOfFreedom> 27

> <Permanent dipole moment(B3LYP, Debye)> 1.475

> <ABC(cm-1)> 14.133 1.731 1.655

> <Scaled freq(cm-1)> 49.1 59.1 80.1 182.8 222.6 335.5 460.0 529.6 663.0 762.0 812.3 911.3 928.1 944.3 1124.8 1287.3 1299.6 1321.8 1403.2 1483.7 1689.2 3041.9 3064.2 3147.0 3408.9 3472.7 3557.0

> <IR intensities(rel.)> 4.5 3.8 6.6 7.8 25.1 93.3 16.9 79.8 60.8 214.2 73.0 2.9 55.0 16.5 33.8 210.3 56.9 126.8 4.4 22.8 90.0 19.2 0.4 8.3 59.4 559.4 26.8

> <Temp(K)> 298.150

> <Pressure(atm)> 1.00000> <DfHg_G3MP2B3(kJ/mol)> -269.7

> <Scaled S(J/molK)> 363.4

> <UNScaled CV(J/molK)> 98.9

Scale factor Stoichiometry Charge Multiplicity Molecular mass DegreeOfFreedom Permanent dipole momentABC(cm-1)

Scaled freq(cm-1) IR intensities(rel.) Temp(K) Pressure(atm)

DfHg_G3MP2B3(kJ/mol) Scaled S(J/molK)

UNScaled CV(J/molK)

SDF content §3 – molecular descriptors> <MPD> 2;1-1-2;1-1-9;1-1-13;2-3-13; 2;1-1-2;1-2-13;2-1-9;2-1-13; 9;1-1-2;1-1-13;2-1-2;2-1-13; 8;1-1-8;1-1-13;2-1-13; 8;1-1-8;1-1-13;2-1-13; 13;1-1-2;2-1-2;2-1-9; 13;1-1-2;2-1-2;2-1-13; 13;1-1-2;2-1-2;2-1-13; 13;1-1-9;2-1-2; 13;1-1-8;2-1-8; 13;1-1-8;2-1-8;

> <MNA> -C(-H(-C)-C(-H-H-C)-O(-H-C))-C(-H(-C)-H(-C)-C(-H-C-O))-O(-H(-O)-C(-H-C-O))-O(-H(-O)-O(-H-O))-O(-H(-O)-O(-H-O))-H(-C(-H-C-O))-H(-C(-H-H-C))-H(-C(-H-H-C))-H(-O(-H-C))-H(-O(-H-O))-H(-O(-H-O))

> <SMI> C(=C)O.OO

> <MolRT> 3

> <InChi> InChI=1S/C2H4O.H2O2/c1-2-3;1-2/h2-3H,1H2;1-2H

> <InChiKey> JJZZTHKXWWHOAE-UHFFFAOYSA-N

> <MCDL> CH;CHH;3OH[2,3;;;5]

$$$$

MPD MNA SMI

MolRT InChi

InChiKey MCDL

Molecular fragment schemes

• Developed in the ’50s• Screens (strutural keys, fingerprints) have been developed in

the ’70s• Generally they represent big strings can be stored effectively -

> compressed• Important role

in providing efficient substructure searching capabilities in large chemical databases,

in similarity searching, in clustering large data sets, in assessing chemical diversity, in conducting SAR and QSAR studies

Images of the optimized structure(depicted differently)

GaussView ChemDraw

www.chemicalize.org (searched after InChI)

MPD (MOLPRINT 2D)

• MPD = Molecular Populational Dynamics• A molecular similarity searching technique

based on atom environments• Atom environments are count vectors of

heavy atoms present at a topological distance from each heavy atom of a molecule

> <MPD> 2;1-1-2;1-1-9;1-1-13;2-3-13; 2;1-1-2;1-2-13;2-1-9;2-1-13; 9;1-1-2;1-1-13;2-1-2;2-1-13; 8;1-1-8;1-1-13;2-1-13; 8;1-1-8;1-1-13;2-1-13; 13;1-1-2;2-1-2;2-1-9; 13;1-1-2;2-1-2;2-1-13; 13;1-1-2;2-1-2;2-1-13; 13;1-1-9;2-1-2; 13;1-1-8;2-1-8; 13;1-1-8;2-1-8;

MNA

• MNA = Multilevel Neighbourhood of Atoms

• 2D molecular fragments suitable for use in QSAR modelling

• Output: a complete descriptor fingerprint per molecule• Fragment: starting at the origin, each atom is

appended to the descriptor immediately followed by a parenthesized list of its neighbours

> <MNA> -C(-H(-C)-C(-H-H-C)-O(-H-C))-C(-H(-C)-H(-C)-C(-H-C-O))-O(-H(-O)-C(-H-C-O))-O(-H(-O)-O(-H-O))-O(-H(-O)-O(-H-O))-H(-C(-H-C-O))-H(-C(-H-H-C))-H(-C(-H-H-C))-H(-O(-H-C))-H(-O(-H-O))-H(-O(-H-O))

SMILES (SMI)

• SMILES = Simplified Molecular Input Line Entry Specification

• A linear text format which can describe the connectivity and chirality of a molecule

• Specifically represents a valence model of a molecule, not a computer data structure, a mathematical abstraction, or an "actual substance"

> <SMI> C(=C)O.OO

MolRT

(easter egg, it’s molarity…)

InChI

• InChI = International Chemical Identifier, • A reliable computerized method to represent identities• A representation of the chemical structure with details• Simple, but unique identifier for molecules (like a barcode)• Different layers separated with delimiters (/)

Main layer Charge layer Stereochemical layer Isotopic layer Fixed-H layer Reconnected layer> <InChi>

InChI=1S/C2H4O.H2O2/c1-2-3;1-2/h2-3H,1H2;1-2H

+

=

=

InChiKey

• A shortened and more browser-preferable form of InChI code• Its lengths is fixed in 27 characters• The first 14 represent the molecular skeleton/connectivity

matrix• Next layer contains 8+1 characters • the first 8-character block encodes stereochemistry and

isotopic substitution information• +1 character defines the kind of InChIKey (S=standard, N=non-

standard)• Next character: used version of InChI• Finishing character: protonation indicator

> <InChiKey> JJZZTHKXWWHOAE-UHFFFAOYSA-N

MCDL

• MCDL = Molecular Chemical Descriptor Language; firstly published in 2001

• Developed for linear representation of structural and other chemical information for chemical databases

• Similar to InChI: both languages are modular, constitution, connectivity, and stereochemistry is represented by individual „modules”

• MCDL provides direct placement of hydrogen atoms, whereas InChI uses a separate block> <MCDL> CH;CHH;3OH[2,3;;;5]

Other useful links and references• Todeschini, Roberto / Consonni, Viviana

Molecular Descriptors for Chemoinformatics, 2., revised and enlarged Edition, 2009.ISBN 978-3-527-31852-0 - Wiley-VCH, Weinheim

• Bender A, Mussa HY, Glen RC, Reiling S.: Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance, J Chem Inf Comput Sci. 2004 Sep-Oct; 44(5):1708-18.

• Gakh AA, Burnett MN.: Modular Chemical Descriptor Language (MCDL): composition, connectivity, and supplementary modules, J Chem Inf Comput Sci. 2001 Nov-Dec; 41(6):1494-9.

• http://arxiv.org/ftp/arxiv/papers/1311/1311.3723.pdf• http://openbabel.org/wiki/Multilevel_Neighborhoods_of_Atoms• http://openbabel.org/wiki/SMILES• http://www.daylight.com/meetings/summerschool98/course/dave/smiles-intro.html• http://www.inchi-trust.org/ (and references therein)• http://www.iupac.org/home/publications/e-resources/inchi/download.html (and

references therein)• http://www.chemspider.com/inchi-resolver/

Your objectives for today

• To check your .sdf file for two chosen isomers• To collect all the codes• To compare them with each other and find

differences

Thank you for your attention!