Useful Information
• The web address for these lectures is
http://www-jmg.ch.cam.ac.uk/cil/partii/ (on front of
handout)
• Assessment Exercises are also at this address. They
will be marked out of ten. Your (hard copy) answers
should be submitted to Dr Rafel Cabot Mesquida*,
Chief Teaching Technician, ( [email protected] )
• Glen exercises due: Feb 15th 2020
• Lectures and handout available on Moodle
*a metal tray in the office G12 labelled “Part II Cheminformatics”
Molecular Informatics
1 molecules and computers
An Introduction to Chemoinformatics, Andrew R.
Leach, Valerie J. Gillet, Springer 2007.
Chemoinformatics: Basic Concepts and Methods, Johann Gasteiger and Thomas
Engel, Wiley-VCH 2003. (2018)
Handbook of Chemoinformatics: From Data to Knowledge, Johann Gasteiger,
Wiley-VCH 2003. (2018)
Chemoinformatics: An Approach to Virtual Screening
By Alexandre Varnek , Alex Tropsha, RSC Publishing
Bunin, Barry A. Chemoinformatics: Theory, Practice, and Products. Dordrecht:
Springer, 2007
Chemoinformatics: An Approach to Virtual Screening By Alexandre Varnek ,
Alex Tropsha, RSC Publishing
Drug Metabolism Prediction: Ed. R Mannhold, H Kubinye, G Folkers, Ed.
Johannes Kirchmair, Methods and Principles in Medicinal Chemistry, Vol 63,
Pub. Wiley-VCH.
Sources:- textbooks/online you may wish to consider if you want
to take the subject further
Journals of Molecular/Cheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics & Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
ArXiv….read this first
Molecular
Informatics
Includes all aspects of the study of molecules on computers.
Also includes Chem(o)informatics
This includes the representation of molecules, databases, display,
simulation, prediction of their properties and the discovery and
design of new molecules and materials.
Molecular informatics is closely related to bioinformatics,
computational chemistry, molecular modelling, simulation, machine
learning and statistics, as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery, hence the concentration on small organic molecules.
Cambridge HPC
Places to find Molecular
Informatics apps
• https://www.macinchem.org/
• Molecules – e.g. RSC-Chemspider)
• Publishers (e.g. ACS/RSC mobile)
• Calculations (e.g. Yield101 for Rxns)
• Visualisation (e.g. Pymol for proteins)
J. Chem. Educ., 2013, 90 (3), pp 320–325
DOI: 10.1021/ed300329e
Cheminformatics 101:
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made. How do we find the best molecule for the
problem we are addressing? Let’s take a look “under the
bonnet” of the way molecules are actually manipulated on the
computer. You will be familiar with:
1. Trivial name: e.g. Morphine
2. IUPAC name: (5α,6α)-7,8-didehydro-4,5-epoxy-17-
methylmorphinan-3,6-diol
However, these names do not convey the structure of molecules
in a way the computer can readily understand. We need to
convert these into “machine readable formats” which allows
ease of searching based on the complexities of molecular
structure. But what is a molecule?
Bear this in mind. Molecules are complicated. When we look at this scene, we add a
huge amount of information from our senses and knowledge – but it nearly all gets
lost in computational representation !
Representing chemistry needs to be engineered to represent materials and processes.
As you will see, we are moving in that direction with more complete representations
of molecules and materials.
Not (5α,6α)-7,8-
didehydro-4,5-
epoxy-17-
methylmorphinan-
3,6-diol !!
A real life mixture
What is a molecule ?
is it a series of connected points ?
a wave function
the sum of its properties ?
In the computer, molecules are therefore abstractions and interpretations of data.
So more experimental data and an appropriate description of a molecule may
translate to a wider reality.
In some recent applications using Deep Neural Networks, a graph theoretical
approach is being used. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016, 30(8), 595-608
Storing molecules: different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus X,Y,Z coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes).
1-Dimensional - Line notations – a string representation of
molecules- here are three examples of different line notations.
SMILES is the most useful and widely used. These are
‘strings’ all of the same molecule.
• Line Notations– WLN
• L66J BMR& DSWQ IN1&1
– ROSDAL
• 1=-5-=10=5,10-1,1-11N-12-
=17=12,3-18S-
19O,18=20O,18=21O,8-22N-
23,22-24
– SMILES
• c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
– IUPAC: 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice – all these notations use just the characters on a standard
typewriter keyboard.
O=C2O[C@@H]1O[C@@](C)(OO4)CC[C@](C(C)CC3
)([H])[C@@]41[C@@]3([H])[C@H]2C
This is an example of a ‘molecule’ and is what is actually stored in the
computer – does it make sense to you ?
O=C2O[C@@H]1O[C@@](C)(OO4)CC[C@](C(C)CC3
)([H])[C@@]41[C@@]3([H])[C@H]2C
It’s Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES.
This is an example of a ‘molecule’ and is what is actually stored in the
computer – does it make sense to you ?
A SMILES tutorial is available at :
http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html
You can practice by drawing a structure and a smiles will be available
by picking the smiley face.
http://www.molinspiration.com/cgi-bin/properties?textMode=1
InChi• A more recent line notation is called
InChI.
• This will address some of the problems
of SMILES e.g. polymers and materials
not covered by SMILES.
• InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans.
• Importantly, it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI. One molecule one InChi.
• Websites are available that can generate
InChI/convert from InChi to structure
from different names and formats.
Again, a string like this is easily
matched on a computer.
RSC ChemSpider
Storing chemical diagrams on computers
• The valence model of a molecule can be represented by a
chemical graph. A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes.
• The spacial position of the nodes, length of the edges and
crossings are irrelevant. Generally, we ignore hydrogens unless
tautomerism or pKa is an issue. Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names e.g. Oxygen)
• Chemical structures are of course more complex than this and
aromaticity, stereochemistry, tautomerism, non-stoichimetric
compounds etc. are often problematic. The computer would
(using a simple graph) deduce these two canonical structures are
different molecules!
• E.g. to solve this, we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph: SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format:
Header Block
describes the molecule e.g. it’s name.
Connection table
defines the molecular structure (atoms and bonds).
Data block (optional)
Properties e.g. volume
Terminator line
a line containing four dollar signs ($$$$),
indicating the end of information on this molecule.
This is probably the most common format to store small molecules – SD files in
software are widely used to store many molecule structures in databases.
A “format” in computer science is a precisely described order of data. The program
reading it expects the information in exactly the right place (or it screws up!).
• Defines the bonding arrangement of a molecule. Treats the molecule
as a labelled graph.
e.g. PDB protein file format
e.g. SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table – describing bonds
• Defines the bonding arrangement of a molecule. Treats the molecule
as a labelled graph.
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
e.g. PDB protein file format
e.g. SD file format
The Connection Table – describing bonds
Why store molecules in 2D• Quite often, we only need the
chemical diagram e.g. to find a
molecule that matches a
chemical structure search
• It is often the case that we don’t
know the conformation (shape)
of a molecule – so storing it in
3D would be pointless. Look at
the changes in conformation in
this molecule at room
temperature.
Example SD file - benzene
• benzene
• ACD/Labs0812062058
• August 2013
• 6 6 0 0 0 0 0 0 0 0 1 V2000
• 1.9050 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
• 1.9050 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
• 0.7531 -0.1282 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
• 0.7531 -2.7882 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
• -0.3987 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
• -0.3987 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
• 2 1 1 0 0 0 0
• 3 1 2 0 0 0 0
• 4 2 2 0 0 0 0
• 5 3 1 0 0 0 0
• 6 4 1 0 0 0 0
• 6 5 2 0 0 0 0
• M END
• $$$$
Molecule nameInformation on this molecule
Comment (e.g. date is used here)
“counts line”, has the
number of atoms and
bonds as a minimum
“atom block” has x,y,z
coordinates of the atom and
element as a minimum
“bond block” 1-line for each bond
from atom - to atom – bond type
These identify the end of this molecule.
Alanine SD file
Bond length is 1.53A
x,y,z symbol, mass diff, charge, stereo, h-count….
Charge:
0 = uncharged or value
other than
these, 1 = +3, 2 = +2, 3 =
+1,
4 = doublet radical,
5 = -1, 6 = -2, 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds?
A key example would be x-ray crystallography. Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge.
The Protein Data Bank (PDB) contains thousands of protein structures. Here
is an electron density map from an x-ray experiment. We see the electrons
but not the nuclei. In a small molecule this is very accurate and we can
almost see the bonds. However, in e.g. a protein structure, which is very
large, we can’t always determine the atom positions exactly so these are
stored and the bonds imputed. The format for storing these is a bit different.
Which
conformation
should this
ring have?
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASE/OXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION. 1.80 ANGSTROMS.
REMARK 200 TEMPERATURE (KELVIN) : 113
REMARK 200 PH : 6.9
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54.588 55.106 64.827 90.00 90.00 90.00 P 21 21 21 4
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.018319 0.000000 0.000000 0.00000
SCALE2 0.000000 0.018147 0.000000 0.00000
SCALE3 0.000000 0.000000 0.015426 0.00000
ATOM 1 N VAL A 1 3.036 -1.035 -3.538 1.00 31.28
ATOM 2 CA VAL A 1 4.283 -0.343 -3.015 1.00 28.41
ATOM 3 C VAL A 1 4.565 1.067 -3.593 1.00 27.64
ATOM 4 O VAL A 1 4.653 1.287 -4.853 1.00 27.17
ATOM 5 CB VAL A 1 5.510 -1.245 -3.141 1.00 30.01
.
.
HETATM 1510 S SO4 A 187 25.993 -1.362 5.893 1.00 22.34
HETATM 1511 O1 SO4 A 187 25.504 -1.159 4.508 1.00 22.13
HETATM 1512 O2 SO4 A 187 27.282 -0.770 6.155 1.00 24.10
HETATM 1513 O3 SO4 A 187 24.953 -1.035 6.850 1.00 15.34
.
.
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
.
.
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms – including x,y,z,
occupancy and temperature factor
Non-protein atoms – including x,y,z,
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions e.g. in driving a
reaction coordinate?
Using Cartesian (x,y,z)
coordinates is very
cumbersome, so instead
we use the natural angles
and distances
This uses internal coordinates
• Also called a Z-matrix
– Used to alter the “internal coordinates” of a molecule (e.g. modelling a reaction).
– Early form of specification of a starting geometry for molecules – sometimes used graph paper, draw the molecule and get a starting set of coordinates before optimisation
– A z-matrix uses the following geometric descriptions to describe molecules:
Bond Length
Bond angle
Torsion angle
Out of plane bending
e.g. a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1.For the first atom to be defined, give the atomic symbol
only.
2.For the second atom, give the atomic symbol, the number
"1", and the name of a variable to describe the distance
between atoms 1 and 2.
3.For the third atom, give the atomic symbol, the atom
number NA, the name of a variable to describe the distance
between the current atom and NA, the atom number NB, and
the name of a variable to describe the angle between the
current atom, NA and NB.
4.For all later atoms, give the atomic symbol, the atom
number NA, the name of a variable to describe the distance
between the current atom and NA, the atom number NB, the
name of a variable to describe the angle between the current
atom, NA and NB, the atom number of another previously
defined atom NC, and finally the name of a variable to
describe the dihedral angle between the current atom, NA,
NB and NC.
5.After all the atoms have been listed, enter a blank line.
6.Next, list each variable with its corresponding value. Use a
separate line for each variable.
7.In some cases, where some of the variables are to be fixed
as constants in a geometry optimisation, they are listed here
after a blank line, rather than above
with the real variables.
8.End the Z-matrix with a blank line.
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 0.96
a1 104.0
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 180.0
l1 1.42
l2 1.09
l3 1.09
l4 1.09
l5 1.09
l6 1.0
a1 109.0
a2 110.0
a3 108.0
a4 110.0
a5 110.0
da1 60.0
da2 120.0
da3 60.0 z-matrix
‘improper’ torsion angles
What about comprehensive
properties of molecules? – they
are more than x,y,z.
XML and molecules
• XML is a computer language that allows ‘metadata’ to be stored. Metadata describes the context of data e.g. the units of measurement, the date the measurement was made, the relationship to other data etc.
• The ‘X’ stands for extensible. It means we can add almost any type of structured data to the file.
• Chemical Markup Language (CML) is being developed specifically for chemistry
• In the future, much more information will be stored with molecules allowing greater re-use of data
• see : Chemical Markup, XML and the World-Wide Web. Part I. Basic principles. P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928.
Ethanol
<CML>
-Can be parsed
-Can contain reactions,
properties etc.
-Can contain
relationships to other
molecules and also
concepts
InChI:
InChI=1/C2H6O/c1-2-3/h3H,2H2,1H3
SMILES:
C(C)O
Molecules and 3-Dimensions• Molecules are of course not flat. Even very flat molecules are not
really flat because of thermal fluctuations. So we represent 3D
molecules by including their coordinates or their internal coordinates.
• Obtaining the 3-dimensional coordinates can involve experiment (x-
ray, electron or neutron diffraction e.g. the Cambridge
Crystallographic Database or the Protein Databank – PDB).
• From these can be obtained atom positions, bonds, coordinates etc.
• There are a number of 3-D construction methods available such as
Corina (put in a SMILES and get a 3-D molecule) which use rules
derived from experiment.
• Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures.
• Conformation still remains to be deduced. There are many methods
that deduce conformations, usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot – but often
in many more torsional angle dimensions).
Molecules in 3D - uses
• More accurate calculation of molecular
properties
• Comparison of the shapes (conformations)
of molecules
• Comparison of the dynamics of molecules
• Calculation of bulk properties
• Simulation of chemical reactions……
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon).
Molecular Dynamics is the most popular method for larger systems. Here is
an example of a 3D simulation – a nanopore for sequencing DNA. Imagine a
series of snapshots of SD files concatenated together to make a movie, just
like a film strip.
https://en.wikipedia.org/wiki/CHARMM
http://ambermd.org/
http://www.ks.uiuc.edu/Research/namd/
http://www.ks.uiuc.edu/Research/vmd/
We can simulate an entire biological receptor
and test new molecules to see how they might
work
• Here is a simulation of an important new cancer target
called ‘Apelin’, with water and the membrane. We used
these to design new molecules that block this receptor.
Apelin’s role in Cancer
‘Apelin’ (a peptide receptor) is involved in cancer– therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth.
• Elizabeth Harford-Wright et al. Pharmacological targeting of apelin impairs glioblastoma
growth, Brain, 2017, awx253, https://doi.org/10.1093/brain/awx253
There are very many file formats –
which could be a real pain....but they
can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball & Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D: Chemical Resource
Kit 2D file
crk3d -- CRK3D: Chemical Resource
Kit 3D file
box -- Dock 3.5 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI Biosym/Insight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball & Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D: Chemical
Resource Kit 2D file
crk3d -- CRK3D: Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 3.5 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program – Babel
Recent IUPAC moves towards a ‘standard’ format, however,
In the near future, there are likely to be many competing
requirements for file content.
Molecules on computers – things to look out for
since what is stored is actually quite crude e.g.Stereochemistry may be relative and not absolute or even incorrect.
In proteins, only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added, and are not always correct. Sometimes Nitrogen
and oxygen get confused.
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect – check they look reasonable.
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds, crystal fields in inorganics,
halogen-aromatic bonds, etc. may be inferred but not observed in the file)
Counter ions of salts may be stripped out – could be important
2 Finding molecules
A typical problem involves finding the
‘right’ molecules
All the molecules
synthesisable
1060 ?Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers: we can consider a few hundred
molecules in our heads, computers can evaluate millions.
1. Finding molecules using Substructure searching
One of the most useful ‘properties’ of a molecule is its molecular graph.
Many software systems allow searching for whole molecules or
fragments of molecules or even ‘similar’ molecules.
For a specific molecule, this means specifying the required search pattern to get an
exact match.
An example might be, Search for an exact match for dimethylaniline. Here, we’ve
used the PubChem database which contains compounds and pharmacological
screening data on 236 million substances: http://pubchem.ncbi.nlm.nih.gov/
A more complex example – find a molecule
containing our query as part of a larger molecule
Results – takes a few seconds to search all
92 million molecules
How does this work?The first thing to notice, is that the searching is so fast, that clearly all the database
is not being searched.
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in.
This is called Hashing.
Speeding up compound matching using hash codes
• The simplest Hash code registers the presence or absence
of fragments in a molecule. e.g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesn’t contain F
.......X
So, a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database, will eliminate most of the
structures
Next step – Substructure (pattern) matching
Exact match.
Match the canonical SMILES, InCHI or full structure.
Beware salts etc. may need to strip out counterions.
Substructure Search.
Supposing we wish to find part of a molecule in a database.
To do this we have to do a substructure search.
A substructure is a sub-graph of the molecular graph. This is
an example of substructural fragments of a larger molecule,
shown in different colours. You could e.g. look for all the
molecules with the phenol fragment.
There are two steps commonly used to do this, firstly we
have to number all the molecules consistently (the Morgan
algorithm), for example to compare two structures to see if
they are identical, then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm).
Substructures in molecules
• ‘Subgraphs’ can be identified in a structure graph
corresponding to fragments of the whole structure
e.g. Rings, substituents etc.
– –OH
– –NH2
– –COOH
– phenyl
• this can be done by
tracing appropriate
paths in the graph
• subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used,
The Morgan algorithm which numbers the molecule uniquely, and
The Ullman algorithm which matches the fragments.
Note – a number
of ‘substructural’ fragments can
be matched here (6 in all)
Gund, P., Ann. Rep. in Med. Chem, 14:299, 1979.
Sheridan, R.P. et al., J. Chem. Inf. Comp. Sci., 29:255, 1989
Brint, A.T., Willett, P., J. Mol. Graphics, 5:49, 1987
J.R. Ullman, "An Algorithm for Subgraph Isomorphism", Journal of the Association of Computing
Machinery (JACM), Vol. 23, pp. 31-42, 1976. http://portal.acm.org/citation.cfm?id=321925
Morgan H.L.: The generation of a unique machine description for chemical structures. Journal of
Chemical Documentation 5, 1965, 107-112.
Read these for more details:-
The basic concept is that molecules are considered as
Graphs.
Atoms are ‘nodes’
Bonds are ‘edges’
A molecule is an example of a labelled graph, nodes can
have labels such as atomic number. The nodes are
connected by the edges (and these can also be labelled
e.g. double bond).
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms). For a computer, the key
point is that the numbering is consistent (unlike humans) and
fast. The Morgan algorithm works as follows:
(i) Assign an integer label i to each node considering its
atomic number, degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase. Then order the nodes by the value of the labels
H. L. Morgan. The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service. J. Chemical Documentation,
5:107–113, 1965.
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
•This is continued until the number of equivalent classes is unchanged
•Numbering starts from highest, then uses next highest connection as 2 etc. Like an onion,
concentric circles from the highest numbered.
•Rules are used to break equalities, such as C before N, double bond before single etc.
•Some equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 . . . 1
C2 . . . 1
C3 . . . 1
C4 1 1 1 .
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial. Renumber
for different starting points. Also,
the ‘edges’ of the graph are annotated to speed up the matching. This is the ‘slow’ part
of ‘matching subgraph isomorphism’ and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 . . . 1
C2 . . . 1
C3 . . . 1
C4 1 1 1 .
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial. Renumber
for different starting points. Also,
the ‘edges’ of the graph are annotated to speed up the matching. This is the ‘slow’ part
of ‘matching subgraph isomorphism’ and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 . . . 1
C2 . . . 1
C3 . . . 1
C4 1 1 1 .
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial. Renumber
for different starting points. Also,
the ‘edges’ of the graph are annotated to speed up the matching. This is the ‘slow’ part
of ‘matching subgraph isomorphism’ and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
• How can we use this type of information to
solve our chemistry problems
– Patents/Markush Structures
– Molecular Similarity
– Molecular Properties
– 3D searching
– Virtual Screening
– Structure Property/Activity Relationships
Top Related