Chemistry and reactions from non-US patents

38
248 th ACS National Meeting, San Francisco CA, USA 10 th August 2014 Chemistry and reactions from non- US patents Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK

description

All US patents from 1976 onwards are freely available in computer-readable formats providing a large corpus for chemical text mining. Other patent offices are increasingly also offering their back-catalogues as XML, allowing chemical text mining to be performed in the same way as for recent US patents. We investigate how much chemistry is found in non-US patents (compared to US patents) and, where the chemistry is present in publications from multiple patent offices, how long were the delays between these publications. We show that non-US patents can be text mined for a large number of chemical reactions and analyse the overlap with reactions from US patents. Finally we use all the extracted chemical reactions to explore whether models for predicting reaction yield may be built from features such as the reaction type and its reaction conditions (as text mined from the patent text).

Transcript of Chemistry and reactions from non-US patents

Page 1: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Chemistry and reactions from non-US patents

Daniel Lowe and Roger Sayle

NextMove Software

Cambridge, UK

Page 2: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Topics

• Coverage of USPTO vs EPO patents

• Extraction of non-chemical-compound data

• Can text-mining provide insights into reaction informatics

Page 3: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

What am I missing by only using the USPTO data?

Page 4: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

EPO and USPTO background

• USPTO: 303k grants published in 2013

• EPO: 67k grants published in 2013

• EPO patents must be in one or more of English, French or German

• USPTO documents are all in English

Page 5: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Epo/USPTO Compounds (using StdInChI)

3,345,877

2,674,069

445,588

USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications

Page 6: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Epo/USPTO Reactions (using separately sorted StdInChIs for reactants/agents and products)

706,524

317,533

99,681

USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications

Page 7: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Effect of different Languages

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

Nu

mb

er

of

stru

ctu

res

ext

ract

ed

English

German

French

Translation performed by Lexichem v2014.Jun

Page 8: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

First Disclosure Occurrence by Language

Excluding compounds disclosed by USPTO (1976-

2013) patents

All compounds

Page 9: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Timeliness

• Competitive intelligence requires up to date information

• Methodology:

– Consider compounds present in both EPO and USPTO date not disclosed by either prior to 2006

– Compare the earliest publication date for each compound

Page 10: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

US morethan 5years

earlier

US 3-5years

earlier

US 2-3years

earlier

US 1-2years

earlier

US 6-12monthsearlier

US 3-6monthsearlier

US 1-3monthsearlier

US 1-4weeksearlier

Within aweek

US 1-4weekslater

US 1-3months

later

US 3-6months

later

US 6-12months

later

US 1-2yearslater

US 2-3yearslater

US 3-5yearslater

US morethan 5yearslater

Co

mp

ou

nd

dis

clo

sure

s

Average lag between EPO/USPTO

Compounds first disclosed between 2006-2013

Mean: US 540 days earlier

Page 11: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

0

10000

20000

30000

40000

50000

60000

70000

US morethan 5years

earlier

US 3-5years

earlier

US 2-3years

earlier

US 1-2years

earlier

US 6-12monthsearlier

US 3-6monthsearlier

US 1-3monthsearlier

US 1-4weeksearlier

Within aweek

US 1-4weekslater

US 1-3months

later

US 3-6months

later

US 6-12months

later

US 1-2yearslater

US 2-3yearslater

US 3-5yearslater

US morethan 5yearslater

Co

mp

ou

nd

dis

clo

sure

s

Applications Vs Applications

Mean: US 223 days earlier

Compounds first disclosed between 2006-2013

Page 12: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

0

20000

40000

60000

80000

100000

120000

140000

US morethan 5years

earlier

US 3-5years

earlier

US 2-3years

earlier

US 1-2years

earlier

US 6-12monthsearlier

US 3-6monthsearlier

US 1-3monthsearlier

US 1-4weeksearlier

Within aweek

US 1-4weekslater

US 1-3months

later

US 3-6months

later

US 6-12months

later

US 1-2yearslater

US 2-3yearslater

US 3-5yearslater

US morethan 5yearslater

Co

mp

ou

nd

dis

clo

sure

s

Grants vs Grants

Compounds first disclosed between 2006-2013

Mean: US 287 days earlier

Page 13: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Page 14: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Epo/US 1976-2013 and chembl19 overlap

187,486

1,155,034 6,278,048

Page 15: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Patents 4-5years earlier

Patents 3-4years earlier

Patents 2-3years earlier

Patents 1-2years earlier

Patents 0-1years earlier

Patents 0-1years later

Patents 1-2years later

Patents 2-3years later

Patents 3-5years later

Patents 4-5years later

Average lag between Patents/CHEMbl19

Compounds first disclosed between 2006-2013

Mean: Patents 345 days earlier

Page 16: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Other data in patents

• Genes/proteins

• Reactions including role of reagents

• Physical quantities

• Spectra

• Diseases

• Organisms

• Companies

• …

Page 17: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Gene/protein identification

• Identify trends in drug target popularity

• Count number of patents (in a year) mentioning a given gene or its gene products and map to the HGNC symbol

• Many genes are referenced by short ambiguous terms hence great care is required to have good precision

Page 18: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Gene/protein identification

(Hepatocyte growth factor receptor) (Epidermal growth factor receptor)

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

0.80%

0.90%

0.00%

0.05%

0.10%

0.15%

0.20%

0.25%

0.30%

0.35%

1976197819801982198419861988199019921994199619982000200220042006200820102012

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Ch

EMB

L)

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Pat

en

ts)

EGFR

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

0.00%

0.02%

0.04%

0.06%

0.08%

0.10%

0.12%

0.14%

1976197819801982198419861988199019921994199619982000200220042006200820102012

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Ch

EMB

L)

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Pat

en

ts)

MET

Page 19: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Gene/protein identification

(Mammalian target of rapamycin) (Tissue plasminogen activator)

0.00%

0.05%

0.10%

0.15%

0.20%

0.25%

0.30%

0.35%

0.40%

0.45%

0.00%

0.02%

0.04%

0.06%

0.08%

0.10%

0.12%

0.14%

1976197819801982198419861988199019921994199619982000200220042006200820102012

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Ch

EMB

L)

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Pat

en

ts)

MTOR

0.00%

0.05%

0.10%

0.15%

0.20%

0.25%

0.30%

0.35%

0.40%

0.00%

0.05%

0.10%

0.15%

0.20%

0.25%

0.30%

0.35%

0.40%

0.45%

0.50%

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Ch

EMB

L)

Pe

rce

nta

ge o

f p

rote

in/g

en

e m

en

tio

ns

in t

hat

ye

ar (

Pat

en

ts)

PLAT

Page 20: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Page 21: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Melting/Boiling Point Extraction

Page 22: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Melting/Boiling Point Extraction

Over 99k so far!

Page 23: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

CHEMICAL reactions

Page 24: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Predicting yield

• Features to consider:

• 2D fingerprints (especially around the reaction centers)

• Reaction type

• Temperature

• Time

• Change in complexity e.g. chiral centres

Page 25: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Data extraction

• Can pull out yields, quantities, times and temperatures

• Can sanity check text-mined yield by calculating from amounts (or even masses)

𝑚𝑎𝑠𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑(𝑔)

𝑚𝑜𝑙𝑎𝑟 𝑚𝑎𝑠𝑠 [𝑐𝑎𝑙𝑐 𝑓𝑟𝑜𝑚 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒](𝑔𝑚𝑜𝑙−1)= 𝑚𝑜𝑙𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑

𝑚𝑜𝑙𝑠 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡

𝑚𝑜𝑙𝑠 𝑜𝑓 𝑙𝑖𝑚𝑖𝑡𝑖𝑛𝑔 𝑟𝑒𝑎𝑐𝑡𝑎𝑛𝑡= %𝑦𝑖𝑒𝑙𝑑

Page 26: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Outliers inevitable

Page 27: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

sCale vs yield

2001-2013 US applications, Suzuki couplings

Page 28: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Temperatures

Page 29: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Identify Synthetic Routes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Intermediates 197702 103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2

Terminal Products 385149 149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5

0

100000

200000

300000

400000

500000

600000

700000

Occ

urr

en

ces

Number of steps

Intermediates

Terminal Products

Page 30: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Number of steps vs Complexity

ringComplexityScore = log10(nRingBridgeAtoms + 1) + log10(nSpiroAtoms + 1) stereoComplexityScore = log10(nStereoCenters + 1) macrocyclePenalty = log10(nMacrocycles + 1) sizePenalty = natoms1.005 − natoms

∑ Ertl & Schuffenhauer 2009 doi:10.1186/1758-2946-1-8

Page 31: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Number of steps vs Complexity

0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5 6 7 8 9 10 11

Nu

mb

er

of

pro

du

ct h

eav

y at

om

s

Number of steps

Average number of heavy atoms

Page 32: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Yield vs Change in Complexity

2001-2013 US applications, all reactions

Page 33: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Trends in Reaction Types

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

19

76

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

Suzu

ki c

ou

plin

gs a

s a

pe

rce

nta

ge o

f re

acti

on

s in

a y

ear

Classification performed using NameRXN

Page 34: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Trends In Solvent Use

0.0%

5.0%

10.0%

15.0%

20.0%

19

76

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

Pe

rce

nta

ge o

f re

acti

on

s in

th

at y

ear

Tetrahydrofuran

Dichloromethane

Water

Dimethylformamide

Methanol

Ethyl acetate

Ethanol

1,4-Dioxane

Toluene

Acetonitrile

Acetic acid

Chloroform

Acetone

Benzene

Page 35: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Are solvents getting greener?

1976 2013

Water (21%) Tetrahydrofuran (15%)

Ethanol (11%) Dichloromethane (14%)

Benzene (8%) Water (13%)

Methanol (7%) Dimethylformamide (10%)

Tetrahydrofuran (5%) Methanol (8%)

Dichloromethane (4%) Ethyl acetate (7%)

Dimethylformamide (4%) Ethanol (5%)

Acetic acid (4%) 1,4-Dioxane (4%)

Chloroform (3%) Toluene (3%)

Acetone (3%) Acetonitrile (3%)

Total for top 10: 71% 82%

Page 36: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Conclusions

• A significant amount of novel chemistry, from EPO patents, comes from the non-English patents

• Compounds disclosed by both the USPTO and EPO are on average published earlier by the USPTO but for many compounds an EPO patent will be the earlier disclosure

• Gene/protein identification can identify clear changes in patenting behaviour over time

• Text mining provides the tools to answer many reaction informatics questions

Page 37: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Acknowledgements

• Funding provided by:

Page 38: Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014

Thank you for your time!

http://nextmovesoftware.com

http://nextmovesoftware.com/blog

[email protected]