Predicting Drug-gene and Drug-disease Networks using Functional Flow Bioinformatics Capstone Project...

Post on 13-Dec-2015

216 views 1 download

Transcript of Predicting Drug-gene and Drug-disease Networks using Functional Flow Bioinformatics Capstone Project...

Predicting Drug-gene and Drug-disease Networksusing Functional Flow

Bioinformatics Capstone Project

School of InformaticsIndiana University

Bloomington, Indiana

2009

Ryan Tran Rene

Purpose: Given putative drug associations with genes,find other drugs that may be associated with those genes.

For each unique gene, Functional Flow will be used todetermine which unannotated drugs are most likely tointeract with that gene.

The method will be based on the similarity of the molecular fingerprints of drugs

MethodsAlgorithmsResults and Conclusions

Unique drugs (pcid)

Daylight SMILES

molecular fingerprintsgNova; MACCS

Tanimoto Scores T(u,v)

Known drug-gene interactions

Edges between nodesE(u,v): 0 or 1

For each unique gene:Functional Flow

from annotated drugs (R=inf)To unannotated drugs (R=0)

Large functional flowsto unannotated drugs

may indicate new drug-gene interactions

Unique genes (pcid)

Matador (Gene Name + PubChem ID)

DrugBank (HGNC ID number + PubChem ID)

HGNC database (Gene Name to HGNC ID)

Goal: To create 2 data bases mapping genes to drugs (PubChem ID) and diseases to drugs. PubChem ID to molecular fingerprints.

Pdb (Pdb id number + Chemical compound name).

UniProt (pdb id to HGNC id)

script (chemical name to pubchem Id)

HGNC database (HGNC ID to Gene Name)

PharmGKB (disease name to gene name) (disease name to drug PubChem ID)

Tools for parsing & scripting: perl, awk, sed, UNIX, Excel, MATLAB (Log-Log), eliminate duplicate pairs, …

Daylight SMILES (from PubChem ID)

MACCS structural key molecular fingerprints (gNova; from SMILES)

100

101

102

100

101

102

100

101

102

103

100

101

102

103

104

OC1C(OC(CO)C(O)C1O) OC2(CO)OC(CO)C(O)C2O

Sucrose

PubChem ID =1115

Unique drugs (pcid)

Daylight SMILES

molecular fingerprintsgNova; MACCS

Tanimoto Scores T(u,v)

Known drug-gene interactions

Edges between nodesE(u,v): 0 or 1

For each unique gene:Functional Flow

from annotated drugs (R=inf)To unannotated drugs (R=0)

Large functional flowsto unannotated drugs

may indicate new drug-gene interactions

Unique genes (pcid)

Tanimoto coefficient (extended Jaccard coefficient)

T(u,v) = (u • v) / (||u||2 + ||v||2 - u • v)

Molecular fingerprints (0’s and 1’s):u = (1,0,1,1,0,1,0,0,1) -> ||u||2 = u • u = 5v = (0,1,1,1,1,0,1,0.1) -> ||v||2 = v • v = 6 (0,0,1,1,0,0,0,0,1) -> u • v = 3T(u,v) = 3/(5+6-3) = 3/8

Random fingerprints (N large): u = (1, 0, 1, 0, …., 1, 0, 1, 0) -> ||u||2 -> N/2v = (1, 0, 0, 1, …., 1, 0, 0, 1) -> ||v||2 ->N/2 (1, 0, 0, 0, …., 1, 0, 0, 0) -> u • v ->N/4T (u,v) -> (N/4)/(N/2+N/2-N/4) = 1/3

E(u,v) = 1; T(u,v) >= threshhold0; T(u,v) < threshhold{ Edges between nodes

0 <= T(u,v) <= 1

Unique drugs (pcid)

Daylight SMILES

molecular fingerprintsgNova; MACCS

Tanimoto Scores T(u,v)

Known drug-gene interactions

Edges between nodesE(u,v): 0 or 1

For each unique gene:Functional Flow

from annotated drugs (R=inf)To unannotated drugs (R=0)

Large functional flowsto unannotated drugs

may indicate new drug-gene interactions

Unique genes (pcid)

D5

D6D4

D1

D2

D3

D8

D9

g5,6

Annotated (Ro = ∞) not annotated (Ro = 0)

D7

1st-iteration flow 2nd-iteration flow

Iterated Functional Flow

drug drug

D5

D6

D1

D2

D3

D8

D7

E(D1,D5)

Flow from Drug D5 (u)

E(D2,D5)

E(D3,D

5)

E(D5,D7)

E(D5,D

6)

E(D5,D8)

Note: Nabieva et al. (2005) accidently omitted Rt-1(u) from their published equation for E/(u,v).

E/(u,v) = E(u,v) • Rt-1(u) / ΣE(u,y); ΣE/(u,y) = Rt-1(u)

gta(u,v) = { 0 ; Rt-1(v) > Rt-1(u)

min[E(u,v),E/(u,v)] ; Rt-1(u) > Rt-1(v)

2nd iteration:u =D5, v=D6R1(u) = 3E/(u,v) = 1 • 3 /6G1(u,v) = 1/2

Rao

(u) ={ ∞ ; node (drug) annotated for gene “a” 0 ; else

Rat(u) = Ra

t-1(u) + Σy gta

(y,u) - Σy gta

(u,y)

Reservoirs increase by net flow into nodes:

functional score = sum of all flows into a node during all iterations:

Rao = (∞, 0, …, 0, ∞, ∞, 0, …, 0)

E =

0 E1,2 E1,3 … E1,N

E2,1 0 E2,3 … E2,N

E3,1 E3,2 0 … E3,N

EN,1 EN,2 … E1,N-1 0…………………

Input:

fa (u) = Σt Σy gat(y,u) Output:

Functional Flow Input and Output

for t = 2 : d + 1 t-1 f(t, :) = f(t - 1, :); for u = 1 : N-1 for v = u+1 : N % no flow if E(u, v) = 0. if E(u, v) ~= 0.; if R(u) > R(v); % compute flow from u to v : ... g = min(E(u, v), R(u) * W(u, v) ); S(v) = S(v) + g ; S(u) = S(u) - g ; f(t, v) = f(t, v) + g ; elseif R(v) > R(u); % compute flow from v to u : ...

g = min(E(u, v), R(v) * W(v, u) ); S(u) = S(u) + g ; S(v) = S(v) - g ; f(t, u) = f(t, u) + g ; end end end end R(:) = S(:);... end

Functional Flow Algorithm

uniquegenes

genes drugs

unique drugs

annotatedR=infinity

unannotatedR=0

Test DrugsR= infinity

Test drugsR=0

Functional Flow - Application and Tests

Repeat process for each geneassociated with a minimal

number of drugs

ranking

Input

sortedscores

Drug Search(Application)

Leave-one-outcross-validation Random

numbers

Precision & recallPrecision-recall plotAverage over unique genes

1

34

sorted*scores

* Not necessary to sort scores for LOOCV

k1234567

Precision = items found/ items retrieved

Recall = items found/ items sought

Information Retrieval:

Precision = True Pos/(True Pos + False Pos)

Recall = True Pos/(True Pos + False Neg) = True Pos/ # Positives

Classification:

Leave-one-outcross-validation (LOOCV)

Hig

her

rank

Omit then rank Functional Flow for: Drug 1 Drug 2 Drug 3

1 1/3 1/3 0.33

2 1/6 1/3 0.22

3 2/9 2/3 0.33

4 3/12 3/3 0.40

5 3/15 3/3 0.33

6 3/18 3/3 0.29

7 3/21 3/3 0.25

k Prec. Recall F1

F1 measure = 2 • prec • recall / (prec. + recall)

k1234567

FPTNFNTNTNTNTN

TPTNTNTNTNTNTN

FPTNTNFNTNTNTN

k=1 k1234567

FPFPFNTNTNTNTN

TPFPTNTNTNTNTN

TNFPTNFNTNTNTN

k=2

k1234567

FPFPTPTNTNTNTN

TPFPFPTNTNTNTN

FPFPFPFNTNTNTN

k=3 k1234567

FPFPTPFPTNTNTN

FPFPFPFPTNTNTN

FPFPFPTPTNTNTN

k=4

LOOCV results(Classifications)

Precision = TP/(TP+FP)

Recall = TP/(TP+FN) = TP / (# positives)

k1234567

FP

FN

TP FP

FN

k=1 k1234567

FPFPFN

TPFP

TNFP

FN

k=2

k1234567

FPFPTP

TPFPFP

FPFPFPFN

k=3 k1234567

FPFPTPFP

TPFPFPFP

TNFPFPTP

k=4

Information RetrievalClassifications

Precision = items found/ items retrieved = TP/(TP+FP)

Recall = items found/ items sought = TP/(TP+FN)

LOOCV Results

Parameters:

Minimum number of annotated drugs

Number of functional flow iterations

Tanimoto threshhold for non-zero edge

Precision-Recall Plots:

Leave-One-Out cross-validation for rankingsk of 1 through 50; averages for genes to whichLOOCV was applied

Random Rankings

Comparison of 4 vs. 10 iterationsfor a minimum of 25 annotated drugs/unique gene

and a Tanimoto threshold of 80%

10 iterations is too many (low precision). Note: prec.(1) = recall(1)

0 0.005 0.01 0.015 0.02 0.0250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 80, annotated 25, intervals 10

test

random

Precision

Rec

all

0 0.01 0.02 0.03 0.040

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 80, annotated 25, intervals 4

test

random

Precision

Rec

all

Comparison of 4 vs. 8 iterationsfor a minimum of 50 annotated drugs/unique gene

and a Tanimoto threshold of 80%

8 iterations is too many (low precision). Note again: For the top-ranked LOOCV functional flow scores precision equals recall (k = 1).

0 0.01 0.02 0.03 0.04 0.050

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

precision

reca

ll

threshold 80, annotated 50, iterations 8

test

random

Precision

Rec

all

0 0.01 0.02 0.03 0.04 0.050

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

precision

reca

ll

threshold 80, annotated 50, intervals 4

test

random

Precision

Rec

all

0 0.01 0.02 0.03 0.040

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 80, annotated 25, intervals 4

test

random

Precision

Rec

all

Comparison of 25 vs. 50 minimum numbers of annotated drugs/unique gene

(for 4 iterations and a Tanimoto threshold of 80%)

Requiring at least 50 annotated drugs increased precision and recall significantly

Effects of averagingKMAX= min(50, #annotated drugs-1)

0 0.01 0.02 0.03 0.04 0.050

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

precision

reca

ll

threshold 80, annotated 50, intervals 4

test

random

Precision

Rec

all

Comparison of 60 vs. 80% Tanimoto thresholds(for 4 iterations and a minimum number of

50 annotated drugs/unique gene)

Increasing the Tanimoto score threshold from 60% to 80%doubled the precision.

0 0.01 0.02 0.03 0.04 0.050

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

precision

reca

ll

threshold 80, annotated 50, intervals 4

test

random

Precision

Rec

all

0 0.005 0.01 0.015 0.020

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 60, annotated 50, intervals 4

test

random

Precision

Rec

all

0 0.005 0.01 0.015 0.020

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 60, annotated 50, intervals 4

test

random

Precision

Rec

all

Comparison of 25 vs. 50 minimum numbers of annotated drugs/unique gene

(for 4 iterations and a Tanimoto threshold of 60%)

0 0.005 0.01 0.015 0.02 0.0250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 60, annotated 25, intervals 4

test

random

Precision

Rec

all

Effects of averagingKMAX= min(50, #annotated drugs-1)

For Tanimoto score threshold of 60% the precision is low.The results are quite variable for k > 28 with fewer annotated drugs.

0 0.01 0.02 0.03 0.040

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 80, annotated 25, intervals 4

test

random

Precision

Rec

all

0 0.005 0.01 0.015 0.02 0.025 0.03 0.0350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 80, annotated 10, intervals 4

test

random

Precision

Rec

all

Comparison of 10 vs. 25 minimum numbers of annotated drugs/unique gene

(for 4 iterations and a Tanimoto threshold of 80%)

Requiring at least 25 annotated drugs increased precision significantly, but predictions using fewer annotated drugs may nevertheless be useful

0 0.01 0.02 0.03 0.040

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 80, annotated 25, intervals 4

test

random

Precision

Rec

all

Comparison of 70 vs. 80% Tanimoto thresholds(for 4 iterations and a minimum number of

25 annotated drugs/unique gene)

Increasing the Tanimoto score threshold from 70% to 80%decreased the precision for the top ranked scores (k=1).

0 0.01 0.02 0.03 0.04 0.050

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

threshold 70, annotated 25, intervals 4

test

random

Precision

Rec

all

Effects of averagingKMAX= min(50, #annotated drugs-1)

0 0.005 0.01 0.015 0.02 0.025 0.030

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

cluster threshold 60, annotated 25, iterations 4

test

random

Precision

Rec

all

0 0.01 0.02 0.03 0.04 0.05 0.06 0.070

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

precision

reca

ll

cluster threshold 70, annotated 25, iterations 4

test

random

PrecisionR

ecal

l

Using Clustered Drugs: Comparison of 60 vs. 70% Tanimoto thresholds(for 4 iterations and a minimum number of

25 annotated drugs/unique gene; graphconncomp)

Average Precision of > 6% achieved for top-ranked drugs (k=1) using clustered drugs only

Using Clustered Drugs: 70% Tanimoto thresholds(for 6 iterations and a minimum number of

20 annotated drugs/unique gene)

Effects of averagingKMAX= min(50, #annotated drugs-1)

Average Precision of > 6% achieved for top-ranked drugs (k=1) using clustered drugs only

0 0.01 0.02 0.03 0.04 0.05 0.06 0.070

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

precision

reca

ll

cluster threshold 70, annotated 20, iterations 6

test

random

Precision

Rec

all

0 0.005 0.01 0.015 0.02 0.025 0.030

0.05

0.1

0.15

0.2

0.25

0.3

0.35

precision

reca

ll

Disease to Drugs 80% threashold 50 annotations 4 intervals

test

random

Disease to Drugs: 80% Tanimoto threshold4 iterations and a minimum number of

50 annotated drugs/unique disease)

Precision

Rec

all

Average precision for top ranks (k=1) is only 2%, butLOOCV precison is double that of random model for k < 10.

Conclusions

With Tanimoto thresholds of 70-80% and relatively fewiterations (~4), Functional Flow may be useful to predicting new drugs that will interact with genes and diseases.

Descisions on parameters will depend on the economics of trading less precision for greater recall (increasing k) and the performance of Leave-One-Out Cross-Validation (LOOCV) for the genes and diseases that are of most interest.

If you look at more rankings you find more drugs, but you have to test more drugs

References

Nabieva, et al., 2005, Whole-proteome prediction of protein functionvia graph-theoretic analysis of interaction maps: bioinformatics, 21, Suppl. 1, 2005, i302–i310.

MacCuish , J. D., and MacCuish, N. E., 2003, Mesa Suite Version 1.2: Fingerprint Module: Mesa Analytics & Computing, LLC

Brown, R. D.; Martin, Y.C., 1996, Use of structure-activity dataTo compare structure-based clustering methods and descriptors for use in compound selection: J. Chem. Inf. Compu. Sci, 36, 572-584.

Gunther, et al., 2007, Super target and Matador: resources for exploring drug-target relationships, Nucleic Acids Research, 1-4

Acknowledgments

Special thanks to Drs. Predrag Radivojac, David Wild, Sun Kim, Mehemet Dalkilic, Rajarshi Guha, Haixu Tang and the faculty of Bioinformatics and Cheminformatics. Also thanks to Jefferson Davis (Math/Stat), Bob Konicek, and of course Linda Hostetter.

Thank you all and enjoy the rest of the summer!