Automatic extraction of bioactivity data from patents

36
253 rd ACS National Meeting, San Francisco CA, USA 4 th April 2017 Automatic extraction of bioactivity data from patents Daniel Lowe * , Stefan Senger and Roger Sayle * * NextMove Software Cambridge, UK GlaxoSmithKline, Stevenage, UK

Transcript of Automatic extraction of bioactivity data from patents

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Automatic extraction of bioactivity data from patents

Daniel Lowe*, Stefan Senger† and Roger Sayle*

*NextMove Software Cambridge, UK †GlaxoSmithKline, Stevenage, UK

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Example Use cases

• “A patent has recently come out on a topic of interest, can the key compounds be extracted with their activity data?”

• “Which compounds have been found to be active against this target?”

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

US Patent data freely available

patents.reedtech.com

(Or from the USPTO: bulkdata.uspto.gov)

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

= text-mined

What are these

compounds?

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Understanding table semantics

SureChEMBL Google Patents After text-mining for chemical entities:

Green = substituent Purple = molecule

Source: US20170050925A9

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

SureChEMBL

Google Patents Patent PDF

PatFetch (NextMove Software) Source: US20010016661A1

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Understanding table semantics

5 columns

6 columns

• Columns merged such that header and body have same number of columns

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Getting the compound structures

• Chemical names

• Chemical sketches

• R-group tables

• Compound identifier associated with any of the above

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Chemical names

• OPSIN (Open Parser for systematic IUPAC nomenclature)

• Dictionaries (ChEMBL/PubChem/NextMove)

• Chemical line formula parsing, especially useful for peptide names and R-group definitions

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Chemical sketches

• Utilize the ChemDraw sketches provided by the USPTO

• Detection and handling of repeat brackets and positional variation

• Fixing obvious errors e.g. undervalent nitrogen near to H atom with no bond

• Labels reinterpreted

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Formula Interpretation Input ChemDraw 15 This work

HATU

C4F9

H3PO4

CON(cHex)2 No result

III-2 No result

N

N+

O

NN

N

N

FP

-F

FF

FF

AT U

C C

FFF

F

F

FF F

F

FF

F F

FF

FF

F

O

N

PO

O

OOH

HH HO P

O

OH

OH

II

2-I

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

R-group tables

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Resolving Identifiers

• Need to “name space” identifiers – “Compound 1”, “Reference compound 1”,

“Example 1” – But “Compound 1” = “cmpd 1” = “cpd. #1”

• Where a column is just called “#” is it a compound number, example number or just a table row number!

• Identifier may be defined multiple times e.g. as a sketch and chemical name

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Resolving Identifiers (text-mining)

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Resolving Identifiers (Sketches)

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Resolving Identifiers (Tables)

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Extracting compound-activity relationships

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Excel table export

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Extracting compound-activity relationships

What is the target?

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Assay identification

• Naïve Bayes classifier trained from assay descriptions identified by BindingDB curators

• 10-fold cross validation: 98.9% recall, 94.7% precision

• Paragraph associated with next table or table mentioned in paragraph

• Target/organism detected • Care taken to avoid common irrelevant

organisms/proteins e.g. bovine serum albumin

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Results

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Results From US Patent applications (2001-Mar 2017)

Red = Bioactivity

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Activities with associated structures per year

0

100,000

200,000

300,000

400,000

500,000

600,000

Activ

itty-

stru

ctur

e re

latio

nshi

ps e

xtra

cted

Publication Year

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Comparison with BindingDB

• Activity data from ~1500 US patent grants (2013-2016) manually extracted over the course of 3 years

• ~150,000 activities • Comparison done on the subset that was made

available in ChEMBL 22_1 (98,898 activity values, 1012 patents)

• As some assay results are missed by the automatic extraction, and some are considered out of scope by BindingDB, difficult to distinguish differences in coverage from genuine disagreements

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Comparison with BindingDB

• Values normalized into nM – 1000s of instances of measurements in nanometers!

• Mid point of ranges taken • Structures compared by StdInChI • Target name normalized to ChEMBL target ID

(organism specific), using either: – ChEMBL target synonyms – Normalize to HGNC symbol and check if HGNC symbol is a

ChEMBL target synonym

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Comparison

Expected values found

Expected structures

found

Expected value +

structure found

Expected value +

structure + target

75% 65% 53% 18%

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Unclear structure assignment

? ?

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Stereochemistry and salts

OH

O

O

N

H

CH3H3CBr

H

H

Patent BindingDB This work

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Long tail of difficult cases

What does this superscript term mean?

What are the units?

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Targets of patent data compared to journal data

ChEMBL 22_1 (excluding BindingDB)

US Patent Applications Common Target Classes

0%

5%

10%

15%

20%

25%

30%

35%

40%

2002

2004

2006

2008

2010

2012

2014

2016

% p

er y

ear

Kinase

GPCR (Family A)

Protease

Nuclear receptor

Voltage-gated ionchannel

Electrochemicaltransporter

Oxidoreductase0%

5%

10%

15%

20%

25%

30%

35%

40%

2002

2004

2006

2008

2010

2012

2014

2016

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Upcoming target classes

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%20

02

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

Perc

enta

ge o

f doc

umen

ts w

ith a

ctiv

ity v

alue

s aga

inst

ta

rget

cla

ss

Epigenetic writer (Patents)

Epigenetic reader (Patents)

Epigenetic writer (ChEMBL exBindingDB)

Epigenetic reader (ChEMBL exBindingDB)

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Future work

• Support for more complex R-group tables • Improve recognition and resolution of protein

target names • Support for activities specified in text e.g.

Example 1 has an IC50 of 12 nM measured at rat EP4

• Resolution of symbols for activity ranges e.g. “A” indicates an IC50 value of less than 100 nM

• Improve assay metadata extraction cf. BioAssay Express

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Disambiguation of Conflicting structure descriptions

Image from original filing

Redrawn by US patent office in

ChemDraw

Intended structure from chemical name

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Conclusions

• Processing all US patents from 2001 to present can be done in less than a day on a desktop PC

• Technique applicable to chemical properties other than activity values

• Compound number <-> structure relationships useful for key compound identification

• For the majority of patents, extracting structure-activity relationships can be significantly expedited

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Acknowledgements

• Noel O`Boyle • John Mayfield • Funding provided by:

253rd ACS National Meeting, San Francisco CA, USA 4th April 2017

Thank you for your time!

http://nextmovesoftware.com http://nextmovesoftware.com/blog

[email protected]