August 2010, ACS National meeting, Boston Representation of Markush structures from molecules...

33
August 2010, ACS National meeting, Boston Representation of Markush structures — from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics

Transcript of August 2010, ACS National meeting, Boston Representation of Markush structures from molecules...

Page 1: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Representation of Markush structures — from molecules towards patents

Szabolcs Csepregi

Solutions for Cheminformatics

Page 2: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Contents

• ChemAxon

• What are Markush structures?

• How to get them?

• What can be done with them?– Enumeration– Storage, search

• Challenges in chemical representation

• Under development

Page 3: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

ChemAxon

• Cheminformatics toolkits and applications

• HQ: Budapest, Hungary

• Founded: 1998

• Main customers: pharma, biotech, publishing

• 3rd party applications and web sites. (e.g. Integrity, Reaxis, PDB ligand search, ELN-s, registration systems, etc)

Page 4: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

ChemAxon

Main products:– Structure drawing & visualization (Marvin family)– Chemical DB tools (JChem family)– Property predictions (Calculator plugins)– Drug discovery tools (Reactor, JKlustor, etc.)

Development strategy: customer-driven

Page 5: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

What are Markush structures

and how to get them?

Page 6: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Markush structures

Generic notation for describing many molecules (= Markush library) in a compact form.

Main usage:– Combinatorial chemistry– Chemistry-related patents

Page 7: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Markush structures

• Current features handled:– R-groups– Atom lists, bond lists– Position variation bond– Link nodes– Repeating units– Homology groups

(aryl, alkyl, etc.)

Page 8: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

ChemAxon Markush project

Goals:– Extend structural search capabilities to combinatorial Markush

structures– Markush enumeration

Complications:– Practical examples may be very complex, methods using

explicit enumeration may be impossible– Extension of current molecular formats (generic features)

Timeline – Pilot study started in 2005 Q4, – First prototype shown at UGM, 2006 June– Released in JChem 5.0, 2008– Markush DARC format support 5.3.0 2010

Page 9: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

How to get Markush structures?

• Drawing – Marvin Sketch

Page 10: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

How to get Markush structures?

• Patent literature – Markush DARC format (*.vmn)

• Compatible with Thomson Reuters MMS patent Markush database(Test set available.)

Page 11: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

How to get Markush structures?

Combinatorial chemistry – Reagent clipping 1. Replace reacting group with attachment point

(Reactor tool)

2. Turn fragments to R-group definitions (Molconvert tool)

3. Add a scaffold (Molconvert tool)

Page 12: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

How to get Markush structures?

Combinatorial chemistry – R-group decomposition1. Filter and identify ligands in chemical library

2. Create Markush structure from R-table

(R-group decomposition tool)

Page 13: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

What to do with them?

Page 14: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Markush Enumeration

• Markush enumeration plugin– Full enumeration– Selected parts only– Random enumeration– Calculate library size– Scaffold alignment

and coloring– Markush code– Optional example

homology groupenumeration

Page 15: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Markush storage & search

• JChem Base and Instant JChem

• No enumeration involved

• Can handle complex Markush structures (1040 or more)

• Substructure and Full structure search

• Broad translation of homology groups is supported. (Homology in DB, specific in query.)

Page 16: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Markush storage & search

Substructure hit visualization

Query

Result in original Markush

Page 17: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Markush storage & search

Substructure hit visualization: „Markush structure reduction”

Query

Result in original Markush

Reduced result

Page 18: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Main use cases

• Patent search hits refining / visualization,

• White space analysis,

• Patent busting,

• Markush structure curation,

• In-house storage of small Markush DB,

• etc...

Page 19: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

MMS evaluation Instant JChem project

Page 20: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenges in chemical representation(solved)

Page 21: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Representation - What we already had

Generic notation in queries:

• Atom lists, bond lists

• R-group queries (Problem: RGFile R-logic and patent R-logic are different! - Solution: Just ignore R-logic.)

• Link nodes

• Some generic atoms (X) – represented as pseudo atoms.

Single or doubleSingle or double

Page 22: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 1: Attachment point

• Multiple – ligand order and attachment orderHeavily used in Markush DARC (up to 8 attachments!)

• Represented as atom property

Parent group(root)

R-group definitions

Order of ligands for G15 (R15)

Order of ligands for G15 (R15)

Attachment points for definitions

Attachment points for definitions

Page 23: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 1: Attachment point

• Embedded R-groups: Grandparent relations may be needed between attachment points:

G3’s attachment point „1” is mapped to

G4’s attachment point „1”

G3’s attachment point „1” is mapped to

G4’s attachment point „1”

Page 24: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 1: Attachment point

• Temporary representation: attached data– ligand order– attachment point in R-group definition– still an atom property– ligand order sometimes in parent group

(grandparent relation)

Order of ligands for R2

Order of ligands for R2

Attachment points for definitions

Attachment points for definitions

Page 25: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 1: Attachment point

• Real attachment object with bond (under development)

– eliminates need for grandparent relations table:

Order of ligands for R4

Order of ligands for R4

Attachment point for R3

Attachment point for R3

Order of ligands for R2

Order of ligands for R2

Attachment points for definitions

Attachment points for definitions

Page 26: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 2: Abbreviations

• Superatom S-groups were originally in Marvin (~700 built-in shortcuts)– Expand / Contract – Search code already handled them

in specific structures.

• M. DARC had 21 shortcuts + 31 peptides.

• Attachment point next to abbreviations– Needed to be visible „outside” and handled

correctly „inside”.– New attachment point solves this also:

Page 27: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 3: Homology groups (generics)

• Pseudoatom representation

• Naming (Still looking for the most descriptive „long” names.)

• Extra conditions: general atom property framework (under development)

Markush DARC name „Long name”

CHK alkyl

CYC carboAlicyclyl

ARY carboAryl

HEA heteroMonoAryl

Page 28: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 4: Frequency variation

• Link nodes

• Repeating units: modified SRU

• Multipliers: – special SRU, 1 outer bonds. – (Currently visualization only.)

• Moieties: – special SRU, 0 outer bonds – to describe (variable) stoichiometry– (Currently visualization only.)

Page 29: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Challenge 5: Position variation bond

• New special S-group type

• Relocatable multicenter atom represents group for bonds

• Also useful to represent multicenter charge and coordination compounds:

Page 30: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

What (else) keep us busy

Page 31: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Under development

• Further improvements in Markush DARC support:– Ring segment groups (XX form a ring)

– New, more robust representation for attachment points

– Homology properties (low alkyl, fused aryl, C1-3, N2-5, etc)

• Ranking of results

• New ways to navigate/zoom Markush structures

• Maximum common substructure search

• Biased enumeration and covering Markush – based on examples in patent.

• Improve search speed to handle larger Markush sets.

• Other Markush formats – Markush InChI standard committee

• Overlap analysis of Markush structures

• Conditions for Markush variables

Page 32: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Summary

• Markush structure storage, search and enumeration at ChemAxon now patent coverage

• Compatible patent data is available from Thomson Reuters

• Well thought out chemical representation

• Continuous development, improvements in the pipeline

Page 33: August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.

August 2010, ACS National meeting, Boston

Acknowledgements

• Development team: Nóra Máté, Róbert Wágner, Szilárd Dóránt, Tamás Csizmazia, Tim Dudgeon, Erika Bíró, Ali Baharev, Ferenc Csizmadia, et al.

• Tim Miller, Steve Hajkowski, Gez Cross and Linda Clark at Thomson Reuters for useful discussions, help and example Markush DARC files

• Many early adopters and colleagues within the field for suggestions and feedback