Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology...

41
Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST, April 19-20, 2004

Transcript of Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology...

Page 1: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Standards for Digital Data Representation

1) The IUPAC/NIST Chemical Identifier2) IUPAC Terminology

NSF WorkshopConstructing a Kinetics Database

NIST, April 19-20, 2004

Page 2: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

• Bad News:– There are more problems than you thought

• Good News:– NIST/IUPAC are trying to solve them for you

The News

Page 3: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Data Tags

STM – Scientific, Technical, Medical ‘Publication’

thermo kinetics spectroscopy synthesis

Chemistry

Page 4: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Data Tags

IUPAC/NIST

Chemical Identity – INChI

Interdisciplinary Terms – Gold & Green

STM – Scientific, Technical, Medical ‘Publication’

Chemistry

Page 5: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 6: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

A Digital ‘Name’ for A Chemical Entity

convert chemical structure to digital ‘signature’

To allow computers to:– Organize chemical data– Disseminate data (queries)– Manage quality control

Page 7: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Current Representations are Inadequate

• Drawing – for humans only

• CAS registry number– Arbitrary value (hard to find and confirm)

– CAS Indexer may not match Specialist

– Expensive, imprecise, incomplete, no hierarchy

• Connection Table– One compound – Many representations

– Embedded ambiguities

• ‘Canonical’ Connection Table– No open standard

Page 8: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Reactive Intermediates

• Ions, radicals, excited states– In principle, no problem

• Equilibrated species– Must specify variability precisely

• Weakly bound complexes– OK if orientation is omitted

• Transition states– Maybe not necessary in data compilation

Page 9: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

ChemWeb, 3/2002

Page 10: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 11: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Nature, May 23, 2002

Page 12: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Requirements

• Different compounds have different identifiers– All distinguishing structural information is included

INChI - 1 INChI - 2=

=

Page 13: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Requirements

• One compound has only one identifier– Include only necessary information

NOO

NOO

N+ OO

NOO

Same INChI

= ==

Page 14: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Two Problems

• Chemicals– Fast isomerization (esp, H-atoms)– Unconventional connectivity

• Chemists– Differing conventions

• Depends on discipline, education and convenience

– Imprecision/uncertainty

Page 15: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

3 Steps to INChI

• Chemistry– ‘Normalize’ Input Structure

• Implement chemical rules

• Math– ‘Canonicalize’ (label the atoms)

• Equivalent atoms get the same label

• Format– ‘Serialize’ Labeled Structure

• Output as character string (‘name’)

Page 16: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

NormalizeSimplify

• Divide structure into ‘layers’– Each layer ‘refines’ structure

• Ignore ‘Electron Density’– Ignore bond type and electron

location

• Stereochemistry– sp2 and sp3 only– Free rotation around single bonds

Page 17: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

formula

connectivity

stereo

isotope

Chemical Substances“Layers”

Page 18: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

4 Connectivity ‘Sublayers’

• Disconnect H-atoms and metals– Create skeleton

• Reconnect Fixed H-atoms– Represent multiple species

• Reconnect mobile H-atoms – A single species

• Reconnect metals-non-metal bonds– Represent bonds to metals

Page 19: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Ignore Electron Density

• Not required for compound identification– Represent ‘excited states’

• Simplify representations– Delocalization, aromaticity, zwitterions, coordination …

H

H

H

H

H

H

Page 20: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Münchnones

O- O

R

N+

R R

O- O

+

R

N

R R

OO

C- R

N+

R R

OO

R

N+

C-

R R

OO+

R

NC-

R R

OO

R

NC

R R

O- O

R

N

R R

+O

C

O

C R

NC

R R

O

N

RO

R R

Simplify - Ignore Electrons

Page 21: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Mobile H-atom (Tautomer) Sublayer

N O

H

N O

H

H-migration between 1,3 heteroatoms

N O[H]

Page 22: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

NitrobenzeneCH5

CH3

CH1

CH2

CH4

C6

N+7 O

8O

9

Canonical numbering

Description Layers

formula C6H5NO2

connectivity 8-7(9)6-4-2-1-3-5-6

H-atoms 1-6H

charges

Page 23: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

MSG tautomeric

C4 C5

O8

CH2

2

O9

CH21

CH3O

10OH7

NH26

Na+1

Canonical numbering

Description Layers

formula C5H8NO4.Na

connectivity 6-3(5(9)10)1-2-4(7)8;

H-atoms 1-2H2,3H,6H2(H-,7,8,9,10);

stereo sp3 3-;

charges -1;+1

Page 24: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

MSG fixed

C4 C5

O8

CH2

2

O9

CH21

CH3O

10OH7

NH26

Na+1

Canonical numbering

Description Layers

formula C5H8NO4.Na

connectivity 6-3(5(9)10)1-2-4(7)8;

H-atoms 1-2H2,3H,6H2(H-,7,8,9,10);

stereo sp3 3-;

H-atoms fixed 7H;

stereo sp3 3-;

charges -1;+1

Page 25: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Ferrocene

Disconnected structure

CH1

CH2

CH3

CH5

CH4

CH1

CH2

CH4

CH5

CH3

Fe2+1

Canonical numbering

Reconnected structure

CH1

Fe2+11

CH2

CH-3

CH5

CH4

CH6

CH-7

CH9

CH10

CH8

Canonical numbering

Description Layers Layers

formula 2C5.Fe C10Fe

connectivity 2*1-2-4-5-3-1; 1-2-4-5-3(1)11(1,2,4,5)6-7(11)9(11)10(11)8(6)11

H-atoms 2*1-5H; 1-10H

charges 2*-1;+2

Page 26: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Auxiliary Output

• Confirmation– Label stereogenic atoms– Identify equivalent atoms

• Warnings/Errors– Unusual valences– Unrecognized input

• ‘Reversibility’– Coordinates– Bond/Charge Location

Page 27: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Testing - OK

Page 28: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Beta Testing

Page 29: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

O

OH

O

OH

N

O

OH

N

OH

O

NH NH

NSC# 666457

50 ms – 2 GHz PC

Performance:Most Challenging NCI-NIH Structure

Page 30: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

INChI FAQs

• How can you represent chemistry without electrons?– Chemistry is not represented, just identity– Whole molecule properties may be added (state, phase,..).

• Do big molecules have big INChIs?– Yes, just like systematic names

• How to handle other tautomer types, substructures,..?– Other software

• Is INChI reversible?– Partly - contains only data needed for ‘naming’– Auxiliary fields can carry structure depiction information

• Is INChI extensible?– New layers can add refinement

Page 31: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Started

Oct. 2002

Page 32: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 33: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 34: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 35: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 36: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

http://www.nicmila.org/Gold/Output/

Miloslav Nic, Jiri Jirat, Czech Republic

Page 37: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Converted - XML

Page 38: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 39: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 40: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Page 41: Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

My Point of View

• A forest of data dictionaries is growing– Horizontally and vertically

• We need to consider forest management

• Some day all reusable data will be tagged