SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open...
Transcript of SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open...
ChemAxon UGM, Budapest
20/05/2015
SureChEMBL: Open Patent Data
George Papadatos, PhD
ChEMBL Group, EMBL-EBI
EMBL-EBI Resources Genes, genomes & variation
ArrayExpress Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Literature &
ontologies
Europe PubMed Central
Gene Ontology
Experimental Factor
OntologyMolecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide
Archive
1000 Genomes
Gene, protein & metabolite expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
SystemsBioModels
Enzyme Portal
BioSamples
Ensembl
Ensembl Genomes
European Genome-phenome Archive
Metagenomics portal
Bioactivity data
Compound
Assay/T
arg
et
>Thrombin
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE
RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT
NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT
TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT
THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY
CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF
EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR
WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR
ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA
NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG
PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE
3. Insight, tools and resources for translational drug discovery
2. Organization, integration, curation and standardization of pharmacology data
1. Scientific facts
Ki = 4.5nM
APTT = 11 min.
ChEMBL: Data for drug discovery
Why looking at patent documents?
• Patent filing and searching
• Legal, financial and commercial incentives & interests
• Prior art, novelty, freedom to operate searches
• Competitive intelligence
• Unprecedented wealth of knowledge
• Most of knowledge will never be disclosed anywhere else
• Average lag of 2-3 years between patent document and journal
publication disclosure for chemistry
From SureChem to SureChEMBL
• Digital Science/Macmillan donated SureChem to EMBL-
EBI
• SureChem: commercial patent chemistry mining product
• Wellcome Trust funds further development
• EMBL-EBI provides an on-going, live service
• Full functionality freely available to everyone
• Query, view and export chemistry from patents
• Complemented with biological annotations
SureChEMBL data processing
WO
EPApplications& Granted
USApplications
& granted
JPAbstracts
Patent
OfficesChemistry Database
SureChEMBL System
Patent PDFs
(service)Application
Users
API
Database
Entity Recognition
SureChem IP
1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-
methylpiperazine
Image to Structure(one method)
Name to Structure (five methods)
OCR
Processed patents(service)
SureChEMBL data processing
WO
EPApplications& Granted
USApplications
& granted
JPAbstracts
Patent
OfficesChemistry Database
SureChEMBL System
Patent PDFs
(service)Application
Users
API
Database
Entity Recognition
SureChem IP
1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-
methylpiperazine
Image to Structure(one method)
Name to Structure (five methods)
OCR
Processed patents(service)
Homepage
Help
Search by keyword and meta-data
Search by chemical structure(sketch
compound)
Search by SMILES, MOL,
SMARTS, name
Search by patent numberFilter by authority
(US, EP, WO and JP)
Filter by document section (title, claims, abstract,
description and images)
Chemical search type
(substructure, similarity, identical) Filter
by date
Filter by MW
www.surechembl.org
Data growth
• ~80K novel compounds every month
• ~800K novel compounds since EBI took over
• 2–7 days for a published patent to be chemically annotated and
searchable in SureChEMBL
Cumulative growth of SureChEMBL compounds
Co
mp
ou
nd
co
un
t
Time
EMBL-EBI chemistry resources
RDF and REST API interfaces
REST API Interface - https://www.ebi.ac.uk/unichem/
Atlas
Ligand induced
transcript response
750
PDBe
Ligand structures
fromprotein
complexes
15K
ChEBI
Nomenclature of primary and
secondary metabolites.
Chemical Ontology
24K
SureChEMBL
Chemicalstructures
from patent literature
16M
ChEMBL
Bioactivity data from literature
and depositions
1.5M
UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >90M
3rd Party Data
ZINC, PubChem, ThomsonPharma DOTF, IUPHAR,
DrugBank, KEGG, NIH NCC,
eMolecules, FDA SRS, PharmGKB,
Selleck, ….
~65M
Data access & exports
• Full compound repository
• FTP download, SDF and CSV format
• Updates quarterly
• Full compound-patent map
• FTP download, flat file
• Updates quarterly
• Data feed client
• Creates a local replica database of SureChEMBL
• Updates daily
Compound-patent map
• Flat file with
• Compound, global frequency, document, section, section frequency, publication date
• Back file
• 187,958,584 unique patent-compound pairs
• 14,076,090 unique compound IDs
• 3,585,233 EP, JP, WO and US patent docs
• 1960-2014
• Quarterly incremental updates
• Q1 2015 is also now available on the FTP
http://chembl.blogspot.co.uk/2015/03/the-surechembl-map-file-is-out.html
Data feed client
http://vartree.blogspot.co.uk/2015/01/how-to-create-your-own-replica-of.html
Use cases with SureChEMBL
• Chemoinformatics
• Chemistry landscape for a particular biological target/disease
• Novel chemistry & scaffolds
• MDS, MCS and R-group analysis for a particular patent family claimed
chemistry
• (Negative) novelty checking with UniChem
• Competitive intelligence
• Reporting
• Patent alerts
• Per target/disease/company
Bioactivity data extraction? Compounds
Target/Assay
Bioactivity
Markush structure extraction?
-alkyl
-aryl
-heteroaryl
-heterocyclyl
-cycloalkyl
….
Biological annotations
Bioannotations soon to be integrated into SureChEMBL interface –
using SciBite’s Termite text mining engine
US
-9012636-B
2
Future steps
• OpenPHACTS ENSO
• Biological tagging of targets, genes, indications and diseases
• Development of integrated use-cases
• Combine chemistry & biology from patents, literature, pathways, etc.
• OpenPHACTS API
• Accessible via KNIME nodes
• Further improvements/added value
• Data quality and accuracy
• Target and compound relevance score
Acknowledgements
ChEMBL team:
• John Overington
• Anne Hersey
• Anna Gaulton
• Mark Davies
• Nathan Dedman
• Michal Nowotka
Collaborators:
• James Siddle
• Richard Koks
• Lee Harland
• Kevin Clark
Support:
Webinar:
http://www.ebi.ac.uk/training/online/course/surechembl-accessing-chemical-patent-data-webinar
Technology partners
ChemAxon UGM, Budapest
20/05/2015
SureChEMBL: Open Patent Data
George Papadatos, PhD
ChEMBL Group, EMBL-EBI
Back-up slides
• Connectivity match on single components - UniChem
ChEMBL-SureChEMBL compound overlap
21.4%SureChEMBL
ChEMBL
1.5M
16M
Too granular? Try scaffolds instead
Level 1 scaffold overlap
57%SureChEMBLChEMBL
61K
298K
Level 1 scaffold overlap
57%SureChEMBLChEMBL
61K
298K
Can we have everything?
Cost
TimeQuality
Common sources of errors
• Small, poor quality images
• OCR errors in names (OCR done by IFI). There is an OCR correction
step, but cannot fix all errors
-> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-3-
vDbenzamide’
• Reliability better for US patents due to inclusion of mol files