SMILES. Simplified molecular input line entry specification The simplified molecular input line...

37
SMILES

Transcript of SMILES. Simplified molecular input line entry specification The simplified molecular input line...

SMILES

Simplified molecular input line entry specification

The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings

SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules

SMILESSMILESSimplified Molecular Input Line Entry

System (SMILES)Widely used AND computationally efficientUses atomic symbols and a set of intuitive

rulesUses hydrogen-suppressed molecular

graphs (HSMG)

Canonical SMILES and Isomeric SMILES

The term Canonical SMILES refers to the version of the SMILES specification that includes rules for ensuring that each distinct chemical molecule has a single unique SMILES representation

– A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a database

The term Isomeric SMILES refers to the version of the SMILES specification that includes extensions to support the specification of isotopes, chirality, and configuration about double bonds

– A notable feature of these rules is that they allow rigorous partial specification of chirality.

Graph-based definition

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph

The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree

Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes

Parentheses are used to indicate points of branching on the tree

SMILES BondsSMILES BondsSINGLE*

DOUBLE

TRIPLE

AROMATIC*

* can be omitted

-

=

#

:

SMILES BranchesSMILES BranchesRepresented by enclosure in

parenthesesCan be nested or stackedExamples:

CC(O)CC is 2-Butanol

OCC(C)C is iso-Butanol

OC(C)(C)C is tert-Butanol

SMILES BondsSMILES BondsEthene

Chloroethene

1,1-Dichloroethene

cis-1,2-Dichloroethene

Trichloroethene

Perchloroethene

C=C

ClC=C

ClC(Cl)=C

ClC=CCl

ClC(Cl)=CCl

ClC(Cl)=C(Cl)Cl

SMILES SymbolsSMILES SymbolsString of alphanumeric characters and

certain punctuation symbolsTerminates at the first space

encountered when read left to rightThe ORGANIC SUBSET:

B, C, N, O, P, S, F, Cl, Br, I

Other SMILES Other SMILES AtomsAtoms

Aliphatic or nonaromatic carbon: CAtom in aromatic ring: lowercase letterDesignate ring closure with pairs of

matching digits, e.g.c1ccccc1 is Benzene, whereas

C1CCCCC1 is Cyclohexane

SMILES ChargesSMILES Charges

Specify attached hydrogens and charges in square brackets

Number of attached hydrogens is the symbol H followed by optional digit

SMILES Charges[H+]

[OH-]

[OH3+]

[Fe++]

[NH4+]

proton

hydroxyl anion

hydronium cation

iron(II) cation

ammonium cation

SMILES Cyclic SMILES Cyclic StructuresStructures

Break one single or one aromatic bond in each ring

Number in any order– Designate ring-breaking atoms by the

same digit following the atomic symbol

Cyclic StructuresCyclic Structures Numbers indicate start and stop of ring Same number indicates start and end of the ring,

entered immediately following the start/end atoms

Only numbers 1 – 9 are used A number should appear only twice Atom can be associated w. 2 consecutive

numbers, e.g., Napthalene: c12ccccc1cccc2

SMILES ConventionsSMILES Conventions

Avoid two consecutive left parentheses if possible

Strive for the fewest number of possible branches

Tautomeric bonds are not designated; enter the appropriate form

Further RestrictionsFurther Restrictions

A branch cannot begin a SMILES notation

A branch cannot immediately follow a double- or triple-bond symbol

Example: C=(CC)C is invalid, butC(=CC)C or C(CC)=C are valid SMILES

SMILES FragmentsSMILES Fragments

Nitro

Nitrate

Nitrite

Sulfonic acid

Cyanide/Nitrile

Azide

Azido

N(=O)(=O)

ON(=O)(=O)

ON(=O)

S(=O)(=O)O

C#N

N=N#N

N+=N-

SMILES MetalsSMILES Metals[Al] [As] [Au] [Be]

[Bi] [Cd] [Ca] [Fe]

[Hg] [K] [Li] [Mg]

[Na] [Ni] [Pt] [Sb]

[Sn] [Zn] [Zr]

Disconnected StructuresDisconnected Structures

Tetramethyl ammonium bromide

C[N+]C(C)C.[Br-]

Isomeric and Chiral SMILESIsomeric and Chiral SMILES

Isomeric configuration indicated by forward and backward slashes: / \

Examples:– trans-1,2-dibromoethene: Br/C=C/Br– cis-1,2-dibromoethene: Br/C=C\Br

Chirality indicated by the “@” symbol

Another Application

SMILESCAS Databasehttp://esc.syrres.com/interkow/smilecas.htm

Over 103,000 SMILES notations Input CAS Registry Number Leads to SMILES and thence to a structure

search

Example 1Example 1

CC(C(C)(C)(Br))C

Example 2Example 2

Example 3Example 3

Example 4Example 4

Example 5Example 5

Example 6Example 6

Example 7Example 7

Example 8Example 8

Example 9Example 9

Example 10Example 10

Example 11Example 11

Example 12Example 12

Example 13Example 13

Example 14Example 14

Example 15Example 15

Example 16Example 16