Chemistry and reactions from non-US patents
-
Upload
nextmove-software -
Category
Technology
-
view
176 -
download
0
description
Transcript of Chemistry and reactions from non-US patents
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Chemistry and reactions from non-US patents
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Topics
• Coverage of USPTO vs EPO patents
• Extraction of non-chemical-compound data
• Can text-mining provide insights into reaction informatics
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
What am I missing by only using the USPTO data?
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
EPO and USPTO background
• USPTO: 303k grants published in 2013
• EPO: 67k grants published in 2013
• EPO patents must be in one or more of English, French or German
• USPTO documents are all in English
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/USPTO Compounds (using StdInChI)
3,345,877
2,674,069
445,588
USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/USPTO Reactions (using separately sorted StdInChIs for reactants/agents and products)
706,524
317,533
99,681
USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Effect of different Languages
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
Nu
mb
er
of
stru
ctu
res
ext
ract
ed
English
German
French
Translation performed by Lexichem v2014.Jun
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
First Disclosure Occurrence by Language
Excluding compounds disclosed by USPTO (1976-
2013) patents
All compounds
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Timeliness
• Competitive intelligence requires up to date information
• Methodology:
– Consider compounds present in both EPO and USPTO date not disclosed by either prior to 2006
– Compare the earliest publication date for each compound
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
US morethan 5years
earlier
US 3-5years
earlier
US 2-3years
earlier
US 1-2years
earlier
US 6-12monthsearlier
US 3-6monthsearlier
US 1-3monthsearlier
US 1-4weeksearlier
Within aweek
US 1-4weekslater
US 1-3months
later
US 3-6months
later
US 6-12months
later
US 1-2yearslater
US 2-3yearslater
US 3-5yearslater
US morethan 5yearslater
Co
mp
ou
nd
dis
clo
sure
s
Average lag between EPO/USPTO
Compounds first disclosed between 2006-2013
Mean: US 540 days earlier
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
10000
20000
30000
40000
50000
60000
70000
US morethan 5years
earlier
US 3-5years
earlier
US 2-3years
earlier
US 1-2years
earlier
US 6-12monthsearlier
US 3-6monthsearlier
US 1-3monthsearlier
US 1-4weeksearlier
Within aweek
US 1-4weekslater
US 1-3months
later
US 3-6months
later
US 6-12months
later
US 1-2yearslater
US 2-3yearslater
US 3-5yearslater
US morethan 5yearslater
Co
mp
ou
nd
dis
clo
sure
s
Applications Vs Applications
Mean: US 223 days earlier
Compounds first disclosed between 2006-2013
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
20000
40000
60000
80000
100000
120000
140000
US morethan 5years
earlier
US 3-5years
earlier
US 2-3years
earlier
US 1-2years
earlier
US 6-12monthsearlier
US 3-6monthsearlier
US 1-3monthsearlier
US 1-4weeksearlier
Within aweek
US 1-4weekslater
US 1-3months
later
US 3-6months
later
US 6-12months
later
US 1-2yearslater
US 2-3yearslater
US 3-5yearslater
US morethan 5yearslater
Co
mp
ou
nd
dis
clo
sure
s
Grants vs Grants
Compounds first disclosed between 2006-2013
Mean: US 287 days earlier
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/US 1976-2013 and chembl19 overlap
187,486
1,155,034 6,278,048
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Patents 4-5years earlier
Patents 3-4years earlier
Patents 2-3years earlier
Patents 1-2years earlier
Patents 0-1years earlier
Patents 0-1years later
Patents 1-2years later
Patents 2-3years later
Patents 3-5years later
Patents 4-5years later
Average lag between Patents/CHEMbl19
Compounds first disclosed between 2006-2013
Mean: Patents 345 days earlier
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Other data in patents
• Genes/proteins
• Reactions including role of reagents
• Physical quantities
• Spectra
• Diseases
• Organisms
• Companies
• …
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
• Identify trends in drug target popularity
• Count number of patents (in a year) mentioning a given gene or its gene products and map to the HGNC symbol
• Many genes are referenced by short ambiguous terms hence great care is required to have good precision
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
(Hepatocyte growth factor receptor) (Epidermal growth factor receptor)
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
0.90%
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Ch
EMB
L)
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Pat
en
ts)
EGFR
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Ch
EMB
L)
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Pat
en
ts)
MET
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
(Mammalian target of rapamycin) (Tissue plasminogen activator)
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.45%
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Ch
EMB
L)
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Pat
en
ts)
MTOR
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.45%
0.50%
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Ch
EMB
L)
Pe
rce
nta
ge o
f p
rote
in/g
en
e m
en
tio
ns
in t
hat
ye
ar (
Pat
en
ts)
PLAT
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Melting/Boiling Point Extraction
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Melting/Boiling Point Extraction
Over 99k so far!
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
CHEMICAL reactions
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Predicting yield
• Features to consider:
• 2D fingerprints (especially around the reaction centers)
• Reaction type
• Temperature
• Time
• Change in complexity e.g. chiral centres
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Data extraction
• Can pull out yields, quantities, times and temperatures
• Can sanity check text-mined yield by calculating from amounts (or even masses)
𝑚𝑎𝑠𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑(𝑔)
𝑚𝑜𝑙𝑎𝑟 𝑚𝑎𝑠𝑠 [𝑐𝑎𝑙𝑐 𝑓𝑟𝑜𝑚 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒](𝑔𝑚𝑜𝑙−1)= 𝑚𝑜𝑙𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑
𝑚𝑜𝑙𝑠 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝑚𝑜𝑙𝑠 𝑜𝑓 𝑙𝑖𝑚𝑖𝑡𝑖𝑛𝑔 𝑟𝑒𝑎𝑐𝑡𝑎𝑛𝑡= %𝑦𝑖𝑒𝑙𝑑
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Outliers inevitable
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
sCale vs yield
2001-2013 US applications, Suzuki couplings
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Temperatures
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Identify Synthetic Routes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Intermediates 197702 103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2
Terminal Products 385149 149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5
0
100000
200000
300000
400000
500000
600000
700000
Occ
urr
en
ces
Number of steps
Intermediates
Terminal Products
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Number of steps vs Complexity
ringComplexityScore = log10(nRingBridgeAtoms + 1) + log10(nSpiroAtoms + 1) stereoComplexityScore = log10(nStereoCenters + 1) macrocyclePenalty = log10(nMacrocycles + 1) sizePenalty = natoms1.005 − natoms
∑ Ertl & Schuffenhauer 2009 doi:10.1186/1758-2946-1-8
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Number of steps vs Complexity
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11
Nu
mb
er
of
pro
du
ct h
eav
y at
om
s
Number of steps
Average number of heavy atoms
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Yield vs Change in Complexity
2001-2013 US applications, all reactions
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Trends in Reaction Types
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
19
76
19
77
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
Suzu
ki c
ou
plin
gs a
s a
pe
rce
nta
ge o
f re
acti
on
s in
a y
ear
Classification performed using NameRXN
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Trends In Solvent Use
0.0%
5.0%
10.0%
15.0%
20.0%
19
76
19
77
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
Pe
rce
nta
ge o
f re
acti
on
s in
th
at y
ear
Tetrahydrofuran
Dichloromethane
Water
Dimethylformamide
Methanol
Ethyl acetate
Ethanol
1,4-Dioxane
Toluene
Acetonitrile
Acetic acid
Chloroform
Acetone
Benzene
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Are solvents getting greener?
1976 2013
Water (21%) Tetrahydrofuran (15%)
Ethanol (11%) Dichloromethane (14%)
Benzene (8%) Water (13%)
Methanol (7%) Dimethylformamide (10%)
Tetrahydrofuran (5%) Methanol (8%)
Dichloromethane (4%) Ethyl acetate (7%)
Dimethylformamide (4%) Ethanol (5%)
Acetic acid (4%) 1,4-Dioxane (4%)
Chloroform (3%) Toluene (3%)
Acetone (3%) Acetonitrile (3%)
Total for top 10: 71% 82%
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Conclusions
• A significant amount of novel chemistry, from EPO patents, comes from the non-English patents
• Compounds disclosed by both the USPTO and EPO are on average published earlier by the USPTO but for many compounds an EPO patent will be the earlier disclosure
• Gene/protein identification can identify clear changes in patenting behaviour over time
• Text mining provides the tools to answer many reaction informatics questions
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Acknowledgements
• Funding provided by:
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog