Bioinformatics from a drug discovery perspective EMBRACE Workshop, 22-23 March 2007 Niclas Jareborg...

Bioinformatics from a drug discovery perspective

EMBRACE Workshop, 22-23 March 2007

Niclas Jareborg

AstraZeneca R&D Södertälje

AstraZeneca Drug Discovery• Research Areas

• CV/GI (Cardiovasc/Gastrointest), RIRA (Resp/Infl), CNS/Pain, Cancer, Infection

• Discovery Sites• UK

– Charnwood (RIRA), Alderley Park (Cancer, CV/GI, RIRA)

• North America– Boston (Cancer, Infection), Willmington (CNS/Pain), Montreal (CNS/Pain)

• Sweden– Lund (RIRA), Mölndal (CVGI), Södertälje (CNS/Pain)

• India– Bangalore (Infection)

• Bioinformatics• All RAs have their own bioinformatics teams• Infrastructure at Alderley Park (db:s, large Linux clusters)

– IS organisation

A target is defined as…

• ... a biological target protein on which a chemical entity (e.g. a drug molecule) exerts its action

• A drug target must be associated with a disease

Drug discovery processTarget identification

Target validation

Hit identification (HTS)

Hit to lead (Lead identification)

Lead optimisation

Candidate drug

Clinical trials

Protein

Assay

Compound library

HitGenes

Effort

Target Definition• Alternative Splicing

• Identify pharmacologically relevant target variant(s)

• Sequence variation• Function

– Target– Metabolizing enzyme

• Binding of substance• Identify most common variant

– Might differ in different populations!

Target Definition• Expression

• Is the target expressed in a relevant human tissue?

• Databases

– Microarrays

– Immunhistochemistry

– In situ hybridization

– Proteomics

• Literature

Target Definition• Selectivity

• How similar are related proteins?• Do similar proteins have functions that we do not want to affect?

• Animal models• Orthologous genes

– Same family size?

• Splice variants– Same as in human?

• Polymorphisms– Differences between inbred strains

• Tissue expression– Overlap human?

• Available transgenes or knock-outs

LaunchDevelopment for Launch

RegistrationCD Pre-nomination

Hit Identification LeadIdentification

LeadOptimisation

Research Development Commercialisation

Support target

identification flag up population variants in target

MS5MS1 MS2 MS3 MS4

Primary screeningIdentify polymorphic and splice variants

Selectivity screeningIdentify paralogues

TargetIdentification

Bioinformatics input to the drug discovery process

Sales

Genetics &Bioinformatics

Support Biomarker identification

Support choice of model organism(s)

In-house generated gene centric information resource

DNA and protein sequenceSimilarity to other species

Splice variants

Genetic mutations

Tissue expression

DNA and protein sequenceSimilarity to other species

In-house generated gene centric information resource

Splice variants

Genetic mutations

Tissue expression

Pathways

Patents

Gene symbol Synonyms

Literature

Functional motifs

Target identification

ESTs sequencingcampaigns

Target Candidates

Validation (in silico, lab bench)

Validation as potential targets

Micro arrays (Affymetrix, glas etc.)

Proteomics

Specificity / selectivity

Targets from different experimental approaches aswell as validation using different technologies

Genetics/genome information

Differential biology

In silico

Literature

Target identification

~30000 human genes

1 potential target

What?

Where?

Novel?

Link to disease?

The human genome offers many potential drug targets

Samuel Svensson, PhD

AstraZeneca R&D Södertälje

Current Drug Targets - few target classesBased on 483 drugs in Goodman and Gilman's "The Pharmacological basis of therapeutics"

Enzymes28%

Nuclear Receptors

2%

DNA2%

Unknown7%

Ion-Channels

5%

Hormones & factors

11%

Receptors (GPCRs)

45%

~2-3.000 druggable targets

< 5.000 targets for small molecule drugs

< 5.000 targets for small molecule drugs

~30000 humangenes

Only a subfraction of gene productsplay a direct role in diseasepatophysiology

Druggable genome ~2-3.000 genes;500 GPCRs, 50 NHRs,>200 ion channels, >1.000 enzymes(e.g. 450 proteases, 500 kinases, >200 others)

pathogens & commensalgut bacteria genes

Number of druggable targets smaller than expected?

Updating the (shrinking?) “Targetome”

Down to 22K ? (see) PMID: 15174140

Some of the 120 InterPro domains are unpromising – many potentials still functional orphans

– realistically nearer 2000 ?

OMIM still only at 1900 and only low numbers of “robust” genetic association results

Current trends

• “Blue sky genomics” -> literature

• Finding “unknown” targets -> prioritizing the lists

• Moving from single target focus• Comparing and ranking of target candidates

– Integration of relevant but disparate data sources

• Better understanding of the target “neighbourhood”– Disease mechanism

– Biomarkers

– Toxicology

Sources of Contextual Information• Structured • Unstructured

20% 80%

• Internal Docs:

– Tox Reports, Clinical Trial Reports.

• External Docs:

– Patents; USPTO, WIPO, EP, etc

– Literature; Medline, Embase– Press Releases:

– competitor, supplier, collaborator, academic (etc)

– Government Agencies– Conference Proceedings– News Feeds

• Internal Chemical Dbs• Internal Biological Dbs• External, Commercial Dbs

– GVK Bio, Ingenuity IPA…

• External Public Dbs

– EMBL, PDB, SNPdb, etc

Mature Technology Emerging Technology

Current approach to retrieving information from unstructured sources is through

manual extractionI.e. Finding documents and reading them!

Dissecting the Decision Making Process

• Locating relevant documents and information• Retrieving them in a useable format

• Reading information• Locating the facts within documents

• Understanding what it means• Putting the information into context• Turning information into knowledge

• Developing new hypotheses• Input into decision making

Finding Extracting Integrating Creating

Finding Extracting Integrating Creating

• Difficult to capture breadth• Chance to miss things• “White space” in failing to find things

• Limited time to read things• Focus on reviews and summaries

• Based on individual scientists own knowledge• Narrow• Biased

• Hypotheses are “per project”• Reactive not proactive

Issues with the Manual Approach

Text mining• Sources

• Literature• Patents• In-house reports

• Information• Protein-protein interactions• Tissue expression• Pharmacological differences

– Splice variants, Polymorphisms– Species

• Toxicology• etc

• Extraction of facts from unstructured data sources• Natural Language Processing, Ontologies• Linguamatics I2E• Knowledgebase generation

Emerging Systems:Text Mining

Biomedical Entity-Relationship Data

CASP9

PARP

CASP3 CASP8

BCL2

Co-published

Co-published Co-published

Co-publishedCo-published

Co-published

Co-Published Information

Activates

BindsBinds

Inactivates

Activates

Binds

Gene:Gene Semantic Relationships

Inc Expression

Gene:Disease Semantic Relationships

Neoplasia

Hyperplasia

Associated with

ActivatesIncreases

Gene:Metabolite Semantic Relationships

ADP-ribose

Synthesizes

MTPN

TNF

Thalidomide

Inhibits

Gene:Chemical/Drug Semantic Relationships

www.ingenuity.com

Pilot Systems:Pathway Analysis: Ingenuity IPA

BER System in Action

Gene Expression

Proteomic

Metabonomic

Significant Biological Entity List:•Gene List•Protein List•Metabolite List

Genetic

ERSystem (Gene/MetaboliteKnowledgebase)

Biological environmentof the list.

Question: What is the underlying biology, pathology, physiology etc associated with this list of entities?What is it telling me?

Canonical pathways associated with the list

Diseases, Biological processes associated with the list

LiteratureEvidence Trail

Hyp

oth

esis Gen

eration

SpeciesHumanRatDogEtc.

Affects

Affects

Affects

Involved in

Involved in

Is a

Linked with

Linked with

Linked with

Observed in

Structuring the KnowledgeDelivers facts as networks of information: Knowledge Bases

Clinical ObservationsDiarrhoeaVomitingLoose StoolsBloatingNauseaEtc.

Cellular Processes

Compound Genes

Observed in

PathologyGI toxicityGI pathology

GI Tox Knowledge Map

Data source integration

CIRA TSR Interface

Disease/Target KB

DataMart

ETL: Biz rules, scoring

ETL

DiseaseKB Interface

DataMart

Complex Data Query

Automated ETL engines

Genes Expression Targets Chem Literature Patent CI

Focused NLP Extraction

Direct Project QueriesOntologies Ontologies

CVGI TSR Interface

DataMart

Viz

ua

lisa

tion

Re

pre

sen

tatio

nE

xtra

ctio

n

Workflow technology• Enables scientists to use, modify and implement solutions that

specialist groups help them put in place; removes (in principle) the need to make extensive IS projects for new data types.

The Knowledge Technology ZigguratD

eci

sio

n M

aki

ng

Pro

cess

Fin

dE

xtra

ctIn

teg

rate

Cre

ate

Builds on

Builds on

Builds on

Builds on

Content Licensing & Access

Document Retrieval and Storage

Fact Extraction (Text Mining)

Information Structuring

Modelling

Content Licensing & Access

Document Retrieval and Storage

Fact Extraction (Text Mining)

Knowledge Structuring

Modelling

Unstructured Information

Developing semantic relationships

KNOWLEDGE BASES

Current focus

Systems biology

SequencesPatented inhibitors

Literature inhibitors and PDB ligands

HTS, foussed screens & project SAR data

Docking & virtual screening

Fingerprint structure search

Competitor compounds

Library and fragment data

AZ protein and ligand structures

Links to endogenous ligands & modulators

Expression data, gene structure, SNPs & splicing

Families of known targets

Sequence alignment structure hom. modelling

Cross-species (orthology) comparisons

Sequences gene names disease literature links

Functional genomics mouse fish yeast

Linking non-homologs with analogous mechanisms

and binding pockets Chemistry

Structures

“Bio” and “Chemo” Informatics Joins to Aid Target Selection

ClinicalPractice

Biology

Chemistry

What do we need to do ?

Testicular DegenerationTesticular Degeneration

Term Associationvia Text Mining

Candidate CompoundCandidate Compound

Proteins

Ligand-Protein Associationvia Experimental & Virtual Methods

Hypothesis GenerationUsing

Informatics/Modelling

A multidimensional jigsaw puzzle• Target - Biological mechanisms - Disease• Target/Off-target - Biological mechanisms - Toxicology

• Polymorphisms• Splice variants• Interaction partners• Tissues• Compounds• Animal models• etc etc etc…

Current needs

• Pathways / Systems biology

• Mining of unstructured data

• Connect biology and chemistry informatics domains

• System / data integration• Ontologies!

• Workflow technology

AZ - EBI

• AZ member of the Industry programme• Training and Education

• Network meetings

• Research, Standards

Bioinformatics from a drug discovery perspective EMBRACE Workshop, 22-23 March 2007 Niclas Jareborg...

Documents

Transcript of Bioinformatics from a drug discovery perspective EMBRACE Workshop, 22-23 March 2007 Niclas Jareborg...