Download - Useful Information · Journal of Computer-Aided Molecular Design Journal of Molecular Graphics & Modeling Journal of Computational Chemistry Journal of Medicinal Chemistry Reviews

Useful Information

• The web address for these lectures is

http://www-jmg.ch.cam.ac.uk/cil/partii/ (on front of

handout)

• Assessment Exercises are also at this address. They

will be marked out of ten. Your (hard copy) answers

should be submitted to Dr Rafel Cabot Mesquida*,

Chief Teaching Technician, ( [email protected] )

• Glen exercises due: Feb 15th 2020

• Lectures and handout available on Moodle

*a metal tray in the office G12 labelled “Part II Cheminformatics”

http://www-jmg.ch.cam.ac.uk/cil/partii/

mailto:[email protected]

Molecular Informatics

1 molecules and computers

An Introduction to Chemoinformatics, Andrew R.

Leach, Valerie J. Gillet, Springer 2007.

Chemoinformatics: Basic Concepts and Methods, Johann Gasteiger and Thomas

Engel, Wiley-VCH 2003. (2018)

Handbook of Chemoinformatics: From Data to Knowledge, Johann Gasteiger,

Wiley-VCH 2003. (2018)

Chemoinformatics: An Approach to Virtual Screening

By Alexandre Varnek , Alex Tropsha, RSC Publishing

Bunin, Barry A. Chemoinformatics: Theory, Practice, and Products. Dordrecht:

Springer, 2007

Chemoinformatics: An Approach to Virtual Screening By Alexandre Varnek ,

Alex Tropsha, RSC Publishing

Drug Metabolism Prediction: Ed. R Mannhold, H Kubinye, G Folkers, Ed.

Johannes Kirchmair, Methods and Principles in Medicinal Chemistry, Vol 63,

Pub. Wiley-VCH.

Sources:- textbooks/online you may wish to consider if you want

to take the subject further

Journals of Molecular/Cheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics & Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

ArXiv….read this first

http://pubs.acs.org/journals/jcisd8/index.html

http://pubs.acs.org/journals/jctcce/

http://jcheminf.com/

http://www.springer.com/chemistry/physical+chemistry/journal/10822

http://www.elsevier.com/wps/find/journaldescription.cws_home/525012/description

https://onlinelibrary.wiley.com/journal/1096987x

http://pubs.acs.org/journals/jmcmar/

http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471215767.html

http://www.elsevier.com/wps/find/journaldescription.cws_home/30921/description

http://www.biomedcentral.com/bmcbioinformatics

http://www.nature.com/nrd/index.html

http://www.informahealthcare.com/loi/edc

http://wires.wiley.com/WileyCDA/WiresJournal/wisId-WCMS.html

https://arxiv.org/search/?query=chemistry&searchtype=all&source=header

Molecular

Informatics

Includes all aspects of the study of molecules on computers.

Also includes Chem(o)informatics

This includes the representation of molecules, databases, display,

simulation, prediction of their properties and the discovery and

design of new molecules and materials.

Molecular informatics is closely related to bioinformatics,

computational chemistry, molecular modelling, simulation, machine

learning and statistics, as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery, hence the concentration on small organic molecules.

Cambridge HPC

Places to find Molecular

Informatics apps

• https://www.macinchem.org/

• Molecules – e.g. RSC-Chemspider)

• Publishers (e.g. ACS/RSC mobile)

• Calculations (e.g. Yield101 for Rxns)

• Visualisation (e.g. Pymol for proteins)

J. Chem. Educ., 2013, 90 (3), pp 320–325

DOI: 10.1021/ed300329e

https://www.macinchem.org/

Cheminformatics 101:

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made. How do we find the best molecule for the

problem we are addressing? Let’s take a look “under the

bonnet” of the way molecules are actually manipulated on the

computer. You will be familiar with:

1. Trivial name: e.g. Morphine

2. IUPAC name: (5α,6α)-7,8-didehydro-4,5-epoxy-17-

methylmorphinan-3,6-diol

However, these names do not convey the structure of molecules

in a way the computer can readily understand. We need to

convert these into “machine readable formats” which allows

ease of searching based on the complexities of molecular

structure. But what is a molecule?

Bear this in mind. Molecules are complicated. When we look at this scene, we add a

huge amount of information from our senses and knowledge – but it nearly all gets

lost in computational representation !

Representing chemistry needs to be engineered to represent materials and processes.

As you will see, we are moving in that direction with more complete representations

of molecules and materials.

Not (5α,6α)-7,8-

didehydro-4,5-

epoxy-17-

methylmorphinan-

3,6-diol !!

A real life mixture

What is a molecule ?

is it a series of connected points ?

a wave function

the sum of its properties ?

In the computer, molecules are therefore abstractions and interpretations of data.

So more experimental data and an appropriate description of a molecule may

translate to a wider reality.

In some recent applications using Deep Neural Networks, a graph theoretical

approach is being used. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016, 30(8), 595-608

Storing molecules: different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus X,Y,Z coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes).

1-Dimensional - Line notations – a string representation of

molecules- here are three examples of different line notations.

SMILES is the most useful and widely used. These are

‘strings’ all of the same molecule.

• Line Notations– WLN

• L66J BMR& DSWQ IN1&1

– ROSDAL

• 1=-5-=10=5,10-1,1-11N-12-

=17=12,3-18S-

19O,18=20O,18=21O,8-22N-

23,22-24

– SMILES

• c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

– IUPAC: 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice – all these notations use just the characters on a standard

typewriter keyboard.

O=C2O[C@@H]1O[C@@](C)(OO4)CC[C@](C(C)CC3

)([H])[C@@]41[C@@]3([H])[C@H]2C

This is an example of a ‘molecule’ and is what is actually stored in the

computer – does it make sense to you ?

O=C2O[C@@H]1O[C@@](C)(OO4)CC[C@](C(C)CC3

)([H])[C@@]41[C@@]3([H])[C@H]2C

It’s Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES.

This is an example of a ‘molecule’ and is what is actually stored in the

computer – does it make sense to you ?

A SMILES tutorial is available at :

http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html

You can practice by drawing a structure and a smiles will be available

by picking the smiley face.

http://www.molinspiration.com/cgi-bin/properties?textMode=1

http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html

http://www.molinspiration.com/cgi-bin/properties?textMode=1

InChi• A more recent line notation is called

InChI.

• This will address some of the problems

of SMILES e.g. polymers and materials

not covered by SMILES.

• InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans.

• Importantly, it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI. One molecule one InChi.

• Websites are available that can generate

InChI/convert from InChi to structure

from different names and formats.

Again, a string like this is easily

matched on a computer.

RSC ChemSpider

https://iupac.org/who-we-are/divisions/division-details/inchi/

http://www.chemspider.com/

Storing chemical diagrams on computers

• The valence model of a molecule can be represented by a

chemical graph. A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes.

• The spacial position of the nodes, length of the edges and

crossings are irrelevant. Generally, we ignore hydrogens unless

tautomerism or pKa is an issue. Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names e.g. Oxygen)

• Chemical structures are of course more complex than this and

aromaticity, stereochemistry, tautomerism, non-stoichimetric

compounds etc. are often problematic. The computer would

(using a simple graph) deduce these two canonical structures are

different molecules!

• E.g. to solve this, we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph: SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format:

Header Block

describes the molecule e.g. it’s name.

Connection table

defines the molecular structure (atoms and bonds).

Data block (optional)

Properties e.g. volume

Terminator line

a line containing four dollar signs ($$$$),

indicating the end of information on this molecule.

This is probably the most common format to store small molecules – SD files in

software are widely used to store many molecule structures in databases.

A “format” in computer science is a precisely described order of data. The program

reading it expects the information in exactly the right place (or it screws up!).

• Defines the bonding arrangement of a molecule. Treats the molecule

as a labelled graph.

e.g. PDB protein file format

e.g. SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table – describing bonds

• Defines the bonding arrangement of a molecule. Treats the molecule

as a labelled graph.

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

e.g. PDB protein file format

e.g. SD file format

The Connection Table – describing bonds

Why store molecules in 2D• Quite often, we only need the

chemical diagram e.g. to find a

molecule that matches a

chemical structure search

• It is often the case that we don’t

know the conformation (shape)

of a molecule – so storing it in

3D would be pointless. Look at

the changes in conformation in

this molecule at room

temperature.

Example SD file - benzene

• benzene

• ACD/Labs0812062058

• August 2013

• 6 6 0 0 0 0 0 0 0 0 1 V2000

• 1.9050 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

• 1.9050 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

• 0.7531 -0.1282 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

• 0.7531 -2.7882 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

• -0.3987 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

• -0.3987 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

• 2 1 1 0 0 0 0

• 3 1 2 0 0 0 0

• 4 2 2 0 0 0 0

• 5 3 1 0 0 0 0

• 6 4 1 0 0 0 0

• 6 5 2 0 0 0 0

• M END

• $$$$

Molecule nameInformation on this molecule

Comment (e.g. date is used here)

“counts line”, has the

number of atoms and

bonds as a minimum

“atom block” has x,y,z

coordinates of the atom and

element as a minimum

“bond block” 1-line for each bond

from atom - to atom – bond type

These identify the end of this molecule.

Alanine SD file

Bond length is 1.53A

x,y,z symbol, mass diff, charge, stereo, h-count….

Charge:

0 = uncharged or value

other than

these, 1 = +3, 2 = +2, 3 =

+1,

4 = doublet radical,

5 = -1, 6 = -2, 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds?

A key example would be x-ray crystallography. Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge.

The Protein Data Bank (PDB) contains thousands of protein structures. Here

is an electron density map from an x-ray experiment. We see the electrons

but not the nuclei. In a small molecule this is very accurate and we can

almost see the bonds. However, in e.g. a protein structure, which is very

large, we can’t always determine the atom positions exactly so these are

stored and the bonds imputed. The format for storing these is a bit different.

Which

conformation

should this

ring have?

https://www.rcsb.org/

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASE/OXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION. 1.80 ANGSTROMS.

REMARK 200 TEMPERATURE (KELVIN) : 113

REMARK 200 PH : 6.9

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54.588 55.106 64.827 90.00 90.00 90.00 P 21 21 21 4

ORIGX1 1.000000 0.000000 0.000000 0.00000

ORIGX2 0.000000 1.000000 0.000000 0.00000

ORIGX3 0.000000 0.000000 1.000000 0.00000

SCALE1 0.018319 0.000000 0.000000 0.00000

SCALE2 0.000000 0.018147 0.000000 0.00000

SCALE3 0.000000 0.000000 0.015426 0.00000

ATOM 1 N VAL A 1 3.036 -1.035 -3.538 1.00 31.28

ATOM 2 CA VAL A 1 4.283 -0.343 -3.015 1.00 28.41

ATOM 3 C VAL A 1 4.565 1.067 -3.593 1.00 27.64

ATOM 4 O VAL A 1 4.653 1.287 -4.853 1.00 27.17

ATOM 5 CB VAL A 1 5.510 -1.245 -3.141 1.00 30.01

.

.

HETATM 1510 S SO4 A 187 25.993 -1.362 5.893 1.00 22.34

HETATM 1511 O1 SO4 A 187 25.504 -1.159 4.508 1.00 22.13

HETATM 1512 O2 SO4 A 187 27.282 -0.770 6.155 1.00 24.10

HETATM 1513 O3 SO4 A 187 24.953 -1.035 6.850 1.00 15.34

.

.

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

.

.

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms – including x,y,z,

occupancy and temperature factor

Non-protein atoms – including x,y,z,

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions e.g. in driving a

reaction coordinate?

Using Cartesian (x,y,z)

coordinates is very

cumbersome, so instead

we use the natural angles

and distances

This uses internal coordinates

• Also called a Z-matrix

– Used to alter the “internal coordinates” of a molecule (e.g. modelling a reaction).

– Early form of specification of a starting geometry for molecules – sometimes used graph paper, draw the molecule and get a starting set of coordinates before optimisation

– A z-matrix uses the following geometric descriptions to describe molecules:

Bond Length

Bond angle

Torsion angle

Out of plane bending

e.g. a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1.For the first atom to be defined, give the atomic symbol

only.

2.For the second atom, give the atomic symbol, the number

"1", and the name of a variable to describe the distance

between atoms 1 and 2.

3.For the third atom, give the atomic symbol, the atom

number NA, the name of a variable to describe the distance

between the current atom and NA, the atom number NB, and

the name of a variable to describe the angle between the

current atom, NA and NB.

4.For all later atoms, give the atomic symbol, the atom

number NA, the name of a variable to describe the distance

between the current atom and NA, the atom number NB, the

name of a variable to describe the angle between the current

atom, NA and NB, the atom number of another previously

defined atom NC, and finally the name of a variable to

describe the dihedral angle between the current atom, NA,

NB and NC.

5.After all the atoms have been listed, enter a blank line.

6.Next, list each variable with its corresponding value. Use a

separate line for each variable.

7.In some cases, where some of the variables are to be fixed

as constants in a geometry optimisation, they are listed here

after a blank line, rather than above

with the real variables.

8.End the Z-matrix with a blank line.

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 0.96

a1 104.0

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 180.0

l1 1.42

l2 1.09

l3 1.09

l4 1.09

l5 1.09

l6 1.0

a1 109.0

a2 110.0

a3 108.0

a4 110.0

a5 110.0

da1 60.0

da2 120.0

da3 60.0 z-matrix

‘improper’ torsion angles

What about comprehensive

properties of molecules? – they

are more than x,y,z.

XML and molecules

• XML is a computer language that allows ‘metadata’ to be stored. Metadata describes the context of data e.g. the units of measurement, the date the measurement was made, the relationship to other data etc.

• The ‘X’ stands for extensible. It means we can add almost any type of structured data to the file.

• Chemical Markup Language (CML) is being developed specifically for chemistry

• In the future, much more information will be stored with molecules allowing greater re-use of data

• see : Chemical Markup, XML and the World-Wide Web. Part I. Basic principles. P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928.

Ethanol

<CML>

-Can be parsed

-Can contain reactions,

properties etc.

-Can contain

relationships to other

molecules and also

concepts

InChI:

InChI=1/C2H6O/c1-2-3/h3H,2H2,1H3

SMILES:

C(C)O

Molecules and 3-Dimensions• Molecules are of course not flat. Even very flat molecules are not

really flat because of thermal fluctuations. So we represent 3D

molecules by including their coordinates or their internal coordinates.

• Obtaining the 3-dimensional coordinates can involve experiment (x-

ray, electron or neutron diffraction e.g. the Cambridge

Crystallographic Database or the Protein Databank – PDB).

• From these can be obtained atom positions, bonds, coordinates etc.

• There are a number of 3-D construction methods available such as

Corina (put in a SMILES and get a 3-D molecule) which use rules

derived from experiment.

• Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures.

• Conformation still remains to be deduced. There are many methods

that deduce conformations, usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot – but often

in many more torsional angle dimensions).

http://www.ccdc.cam.ac.uk/pages/Home.aspx

http://www.rcsb.org/pdb/home/home.do

https://www.mn-am.com/online_demos/corina_demo

http://en.wikipedia.org/wiki/Molecular_mechanics

http://en.wikipedia.org/wiki/Quantum_mechanics

http://www.ncbi.nlm.nih.gov/pubmed/12602956

http://www.ncbi.nlm.nih.gov/pubmed/23935893

http://en.wikipedia.org/wiki/Ramachandran_plot

Molecules in 3D - uses

• More accurate calculation of molecular

properties

• Comparison of the shapes (conformations)

of molecules

• Comparison of the dynamics of molecules

• Calculation of bulk properties

• Simulation of chemical reactions……

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon).

Molecular Dynamics is the most popular method for larger systems. Here is

an example of a 3D simulation – a nanopore for sequencing DNA. Imagine a

series of snapshots of SD files concatenated together to make a movie, just

like a film strip.

https://en.wikipedia.org/wiki/CHARMM

http://ambermd.org/

http://www.ks.uiuc.edu/Research/namd/

http://www.ks.uiuc.edu/Research/vmd/

https://en.wikipedia.org/wiki/CHARMM

http://ambermd.org/

http://www.ks.uiuc.edu/Research/namd/

http://www.ks.uiuc.edu/Research/vmd/

We can simulate an entire biological receptor

and test new molecules to see how they might

work

• Here is a simulation of an important new cancer target

called ‘Apelin’, with water and the membrane. We used

these to design new molecules that block this receptor.

Apelin’s role in Cancer

‘Apelin’ (a peptide receptor) is involved in cancer– therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth.

• Elizabeth Harford-Wright et al. Pharmacological targeting of apelin impairs glioblastoma

growth, Brain, 2017, awx253, https://doi.org/10.1093/brain/awx253

https://doi.org/10.1093/brain/awx253

There are very many file formats –

which could be a real pain....but they

can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball & Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D: Chemical Resource

Kit 2D file

crk3d -- CRK3D: Chemical Resource

Kit 3D file

box -- Dock 3.5 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI Biosym/Insight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball & Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D: Chemical

Resource Kit 2D file

crk3d -- CRK3D: Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 3.5 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program – Babel

Recent IUPAC moves towards a ‘standard’ format, however,

In the near future, there are likely to be many competing

requirements for file content.

http://openbabel.org/wiki/Main_Page

Molecules on computers – things to look out for

since what is stored is actually quite crude e.g.Stereochemistry may be relative and not absolute or even incorrect.

In proteins, only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added, and are not always correct. Sometimes Nitrogen

and oxygen get confused.

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect – check they look reasonable.

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds, crystal fields in inorganics,

halogen-aromatic bonds, etc. may be inferred but not observed in the file)

Counter ions of salts may be stripped out – could be important

2 Finding molecules

A typical problem involves finding the

‘right’ molecules

All the molecules

synthesisable

1060 ?Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers: we can consider a few hundred

molecules in our heads, computers can evaluate millions.

1. Finding molecules using Substructure searching

One of the most useful ‘properties’ of a molecule is its molecular graph.

Many software systems allow searching for whole molecules or

fragments of molecules or even ‘similar’ molecules.

For a specific molecule, this means specifying the required search pattern to get an

exact match.

An example might be, Search for an exact match for dimethylaniline. Here, we’ve

used the PubChem database which contains compounds and pharmacological

screening data on 236 million substances: http://pubchem.ncbi.nlm.nih.gov/

http://pubchem.ncbi.nlm.nih.gov/

A more complex example – find a molecule

containing our query as part of a larger molecule

Results – takes a few seconds to search all

92 million molecules

How does this work?The first thing to notice, is that the searching is so fast, that clearly all the database

is not being searched.

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in.

This is called Hashing.

Speeding up compound matching using hash codes

• The simplest Hash code registers the presence or absence

of fragments in a molecule. e.g.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesn’t contain F

.......X

So, a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database, will eliminate most of the

structures

Next step – Substructure (pattern) matching

Exact match.

Match the canonical SMILES, InCHI or full structure.

Beware salts etc. may need to strip out counterions.

Substructure Search.

Supposing we wish to find part of a molecule in a database.

To do this we have to do a substructure search.

A substructure is a sub-graph of the molecular graph. This is

an example of substructural fragments of a larger molecule,

shown in different colours. You could e.g. look for all the

molecules with the phenol fragment.

There are two steps commonly used to do this, firstly we

have to number all the molecules consistently (the Morgan

algorithm), for example to compare two structures to see if

they are identical, then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm).

Substructures in molecules

• ‘Subgraphs’ can be identified in a structure graph

corresponding to fragments of the whole structure

e.g. Rings, substituents etc.

– –OH

– –NH2

– –COOH

– phenyl

• this can be done by

tracing appropriate

paths in the graph

• subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used,

The Morgan algorithm which numbers the molecule uniquely, and

The Ullman algorithm which matches the fragments.

Note – a number

of ‘substructural’ fragments can

be matched here (6 in all)

Gund, P., Ann. Rep. in Med. Chem, 14:299, 1979.

Sheridan, R.P. et al., J. Chem. Inf. Comp. Sci., 29:255, 1989

Brint, A.T., Willett, P., J. Mol. Graphics, 5:49, 1987

J.R. Ullman, "An Algorithm for Subgraph Isomorphism", Journal of the Association of Computing

Machinery (JACM), Vol. 23, pp. 31-42, 1976. http://portal.acm.org/citation.cfm?id=321925

Morgan H.L.: The generation of a unique machine description for chemical structures. Journal of

Chemical Documentation 5, 1965, 107-112.

Read these for more details:-

http://portal.acm.org/citation.cfm?id=321925

https://pubs.acs.org/doi/abs/10.1021/c160017a018

The basic concept is that molecules are considered as

Graphs.

Atoms are ‘nodes’

Bonds are ‘edges’

A molecule is an example of a labelled graph, nodes can

have labels such as atomic number. The nodes are

connected by the edges (and these can also be labelled

e.g. double bond).

http://upload.wikimedia.org/wikipedia/commons/2/28/6n-graph2.svg

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms). For a computer, the key

point is that the numbering is consistent (unlike humans) and

fast. The Morgan algorithm works as follows:

(i) Assign an integer label i to each node considering its

atomic number, degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase. Then order the nodes by the value of the labels

H. L. Morgan. The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service. J. Chemical Documentation,

5:107–113, 1965.

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

•This is continued until the number of equivalent classes is unchanged

•Numbering starts from highest, then uses next highest connection as 2 etc. Like an onion,

concentric circles from the highest numbered.

•Rules are used to break equalities, such as C before N, double bond before single etc.

•Some equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 . . . 1

C2 . . . 1

C3 . . . 1

C4 1 1 1 .

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial. Renumber

for different starting points. Also,

the ‘edges’ of the graph are annotated to speed up the matching. This is the ‘slow’ part

of ‘matching subgraph isomorphism’ and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

• How can we use this type of information to

solve our chemistry problems

– Patents/Markush Structures

– Molecular Similarity

– Molecular Properties

– 3D searching

– Virtual Screening

– Structure Property/Activity Relationships