Automatic extraction of bioactivity data from patents
-
Upload
nextmove-software -
Category
Software
-
view
197 -
download
3
Transcript of Automatic extraction of bioactivity data from patents
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Automatic extraction of bioactivity data from patents
Daniel Lowe*, Stefan Senger† and Roger Sayle*
*NextMove Software Cambridge, UK †GlaxoSmithKline, Stevenage, UK
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Example Use cases
• “A patent has recently come out on a topic of interest, can the key compounds be extracted with their activity data?”
• “Which compounds have been found to be active against this target?”
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
US Patent data freely available
patents.reedtech.com
(Or from the USPTO: bulkdata.uspto.gov)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
= text-mined
What are these
compounds?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
SureChEMBL Google Patents After text-mining for chemical entities:
Green = substituent Purple = molecule
Source: US20170050925A9
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
SureChEMBL
Google Patents Patent PDF
PatFetch (NextMove Software) Source: US20010016661A1
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
5 columns
6 columns
• Columns merged such that header and body have same number of columns
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Getting the compound structures
• Chemical names
• Chemical sketches
• R-group tables
• Compound identifier associated with any of the above
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical names
• OPSIN (Open Parser for systematic IUPAC nomenclature)
• Dictionaries (ChEMBL/PubChem/NextMove)
• Chemical line formula parsing, especially useful for peptide names and R-group definitions
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical sketches
• Utilize the ChemDraw sketches provided by the USPTO
• Detection and handling of repeat brackets and positional variation
• Fixing obvious errors e.g. undervalent nitrogen near to H atom with no bond
• Labels reinterpreted
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Formula Interpretation Input ChemDraw 15 This work
HATU
C4F9
H3PO4
CON(cHex)2 No result
III-2 No result
N
N+
O
NN
N
N
FP
-F
FF
FF
AT U
C C
FFF
F
F
FF F
F
FF
F F
FF
FF
F
O
N
PO
O
OOH
HH HO P
O
OH
OH
II
2-I
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
• Need to “name space” identifiers – “Compound 1”, “Reference compound 1”,
“Example 1” – But “Compound 1” = “cmpd 1” = “cpd. #1”
• Where a column is just called “#” is it a compound number, example number or just a table row number!
• Identifier may be defined multiple times e.g. as a sketch and chemical name
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers (text-mining)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity relationships
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity relationships
What is the target?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Assay identification
• Naïve Bayes classifier trained from assay descriptions identified by BindingDB curators
• 10-fold cross validation: 98.9% recall, 94.7% precision
• Paragraph associated with next table or table mentioned in paragraph
• Target/organism detected • Care taken to avoid common irrelevant
organisms/proteins e.g. bovine serum albumin
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results From US Patent applications (2001-Mar 2017)
Red = Bioactivity
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Activities with associated structures per year
0
100,000
200,000
300,000
400,000
500,000
600,000
Activ
itty-
stru
ctur
e re
latio
nshi
ps e
xtra
cted
Publication Year
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Activity data from ~1500 US patent grants (2013-2016) manually extracted over the course of 3 years
• ~150,000 activities • Comparison done on the subset that was made
available in ChEMBL 22_1 (98,898 activity values, 1012 patents)
• As some assay results are missed by the automatic extraction, and some are considered out of scope by BindingDB, difficult to distinguish differences in coverage from genuine disagreements
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Values normalized into nM – 1000s of instances of measurements in nanometers!
• Mid point of ranges taken • Structures compared by StdInChI • Target name normalized to ChEMBL target ID
(organism specific), using either: – ChEMBL target synonyms – Normalize to HGNC symbol and check if HGNC symbol is a
ChEMBL target synonym
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison
Expected values found
Expected structures
found
Expected value +
structure found
Expected value +
structure + target
75% 65% 53% 18%
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Stereochemistry and salts
OH
O
O
N
H
CH3H3CBr
H
H
Patent BindingDB This work
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Long tail of difficult cases
What does this superscript term mean?
What are the units?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Targets of patent data compared to journal data
ChEMBL 22_1 (excluding BindingDB)
US Patent Applications Common Target Classes
0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
% p
er y
ear
Kinase
GPCR (Family A)
Protease
Nuclear receptor
Voltage-gated ionchannel
Electrochemicaltransporter
Oxidoreductase0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Upcoming target classes
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%20
02
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Perc
enta
ge o
f doc
umen
ts w
ith a
ctiv
ity v
alue
s aga
inst
ta
rget
cla
ss
Epigenetic writer (Patents)
Epigenetic reader (Patents)
Epigenetic writer (ChEMBL exBindingDB)
Epigenetic reader (ChEMBL exBindingDB)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Future work
• Support for more complex R-group tables • Improve recognition and resolution of protein
target names • Support for activities specified in text e.g.
Example 1 has an IC50 of 12 nM measured at rat EP4
• Resolution of symbols for activity ranges e.g. “A” indicates an IC50 value of less than 100 nM
• Improve assay metadata extraction cf. BioAssay Express
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Disambiguation of Conflicting structure descriptions
Image from original filing
Redrawn by US patent office in
ChemDraw
Intended structure from chemical name
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Conclusions
• Processing all US patents from 2001 to present can be done in less than a day on a desktop PC
• Technique applicable to chemical properties other than activity values
• Compound number <-> structure relationships useful for key compound identification
• For the majority of patents, extracting structure-activity relationships can be significantly expedited
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Acknowledgements
• Noel O`Boyle • John Mayfield • Funding provided by:
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Thank you for your time!
http://nextmovesoftware.com http://nextmovesoftware.com/blog