Predicting Drug-gene and Drug-disease Networks using Functional Flow Bioinformatics Capstone Project...
-
Upload
gavin-oliver -
Category
Documents
-
view
215 -
download
1
Transcript of Predicting Drug-gene and Drug-disease Networks using Functional Flow Bioinformatics Capstone Project...
Predicting Drug-gene and Drug-disease Networksusing Functional Flow
Bioinformatics Capstone Project
School of InformaticsIndiana University
Bloomington, Indiana
2009
Ryan Tran Rene
Purpose: Given putative drug associations with genes,find other drugs that may be associated with those genes.
For each unique gene, Functional Flow will be used todetermine which unannotated drugs are most likely tointeract with that gene.
The method will be based on the similarity of the molecular fingerprints of drugs
MethodsAlgorithmsResults and Conclusions
Unique drugs (pcid)
Daylight SMILES
molecular fingerprintsgNova; MACCS
Tanimoto Scores T(u,v)
Known drug-gene interactions
Edges between nodesE(u,v): 0 or 1
For each unique gene:Functional Flow
from annotated drugs (R=inf)To unannotated drugs (R=0)
Large functional flowsto unannotated drugs
may indicate new drug-gene interactions
Unique genes (pcid)
Matador (Gene Name + PubChem ID)
DrugBank (HGNC ID number + PubChem ID)
HGNC database (Gene Name to HGNC ID)
Goal: To create 2 data bases mapping genes to drugs (PubChem ID) and diseases to drugs. PubChem ID to molecular fingerprints.
Pdb (Pdb id number + Chemical compound name).
UniProt (pdb id to HGNC id)
script (chemical name to pubchem Id)
HGNC database (HGNC ID to Gene Name)
PharmGKB (disease name to gene name) (disease name to drug PubChem ID)
Tools for parsing & scripting: perl, awk, sed, UNIX, Excel, MATLAB (Log-Log), eliminate duplicate pairs, …
Daylight SMILES (from PubChem ID)
MACCS structural key molecular fingerprints (gNova; from SMILES)
100
101
102
100
101
102
100
101
102
103
100
101
102
103
104
OC1C(OC(CO)C(O)C1O) OC2(CO)OC(CO)C(O)C2O
Sucrose
PubChem ID =1115
Unique drugs (pcid)
Daylight SMILES
molecular fingerprintsgNova; MACCS
Tanimoto Scores T(u,v)
Known drug-gene interactions
Edges between nodesE(u,v): 0 or 1
For each unique gene:Functional Flow
from annotated drugs (R=inf)To unannotated drugs (R=0)
Large functional flowsto unannotated drugs
may indicate new drug-gene interactions
Unique genes (pcid)
Tanimoto coefficient (extended Jaccard coefficient)
T(u,v) = (u • v) / (||u||2 + ||v||2 - u • v)
Molecular fingerprints (0’s and 1’s):u = (1,0,1,1,0,1,0,0,1) -> ||u||2 = u • u = 5v = (0,1,1,1,1,0,1,0.1) -> ||v||2 = v • v = 6 (0,0,1,1,0,0,0,0,1) -> u • v = 3T(u,v) = 3/(5+6-3) = 3/8
Random fingerprints (N large): u = (1, 0, 1, 0, …., 1, 0, 1, 0) -> ||u||2 -> N/2v = (1, 0, 0, 1, …., 1, 0, 0, 1) -> ||v||2 ->N/2 (1, 0, 0, 0, …., 1, 0, 0, 0) -> u • v ->N/4T (u,v) -> (N/4)/(N/2+N/2-N/4) = 1/3
E(u,v) = 1; T(u,v) >= threshhold0; T(u,v) < threshhold{ Edges between nodes
0 <= T(u,v) <= 1
Unique drugs (pcid)
Daylight SMILES
molecular fingerprintsgNova; MACCS
Tanimoto Scores T(u,v)
Known drug-gene interactions
Edges between nodesE(u,v): 0 or 1
For each unique gene:Functional Flow
from annotated drugs (R=inf)To unannotated drugs (R=0)
Large functional flowsto unannotated drugs
may indicate new drug-gene interactions
Unique genes (pcid)
D5
D6D4
D1
D2
D3
D8
D9
g5,6
Annotated (Ro = ∞) not annotated (Ro = 0)
D7
1st-iteration flow 2nd-iteration flow
Iterated Functional Flow
drug drug
D5
D6
D1
D2
D3
D8
D7
E(D1,D5)
Flow from Drug D5 (u)
E(D2,D5)
E(D3,D
5)
E(D5,D7)
E(D5,D
6)
E(D5,D8)
Note: Nabieva et al. (2005) accidently omitted Rt-1(u) from their published equation for E/(u,v).
E/(u,v) = E(u,v) • Rt-1(u) / ΣE(u,y); ΣE/(u,y) = Rt-1(u)
gta(u,v) = { 0 ; Rt-1(v) > Rt-1(u)
min[E(u,v),E/(u,v)] ; Rt-1(u) > Rt-1(v)
2nd iteration:u =D5, v=D6R1(u) = 3E/(u,v) = 1 • 3 /6G1(u,v) = 1/2
Rao
(u) ={ ∞ ; node (drug) annotated for gene “a” 0 ; else
Rat(u) = Ra
t-1(u) + Σy gta
(y,u) - Σy gta
(u,y)
Reservoirs increase by net flow into nodes:
functional score = sum of all flows into a node during all iterations:
Rao = (∞, 0, …, 0, ∞, ∞, 0, …, 0)
E =
0 E1,2 E1,3 … E1,N
E2,1 0 E2,3 … E2,N
E3,1 E3,2 0 … E3,N
EN,1 EN,2 … E1,N-1 0…………………
Input:
fa (u) = Σt Σy gat(y,u) Output:
Functional Flow Input and Output
for t = 2 : d + 1 t-1 f(t, :) = f(t - 1, :); for u = 1 : N-1 for v = u+1 : N % no flow if E(u, v) = 0. if E(u, v) ~= 0.; if R(u) > R(v); % compute flow from u to v : ... g = min(E(u, v), R(u) * W(u, v) ); S(v) = S(v) + g ; S(u) = S(u) - g ; f(t, v) = f(t, v) + g ; elseif R(v) > R(u); % compute flow from v to u : ...
g = min(E(u, v), R(v) * W(v, u) ); S(u) = S(u) + g ; S(v) = S(v) - g ; f(t, u) = f(t, u) + g ; end end end end R(:) = S(:);... end
Functional Flow Algorithm
uniquegenes
genes drugs
unique drugs
annotatedR=infinity
unannotatedR=0
Test DrugsR= infinity
Test drugsR=0
Functional Flow - Application and Tests
Repeat process for each geneassociated with a minimal
number of drugs
ranking
Input
sortedscores
Drug Search(Application)
Leave-one-outcross-validation Random
numbers
Precision & recallPrecision-recall plotAverage over unique genes
1
34
sorted*scores
* Not necessary to sort scores for LOOCV
k1234567
Precision = items found/ items retrieved
Recall = items found/ items sought
Information Retrieval:
Precision = True Pos/(True Pos + False Pos)
Recall = True Pos/(True Pos + False Neg) = True Pos/ # Positives
Classification:
Leave-one-outcross-validation (LOOCV)
Hig
her
rank
Omit then rank Functional Flow for: Drug 1 Drug 2 Drug 3
1 1/3 1/3 0.33
2 1/6 1/3 0.22
3 2/9 2/3 0.33
4 3/12 3/3 0.40
5 3/15 3/3 0.33
6 3/18 3/3 0.29
7 3/21 3/3 0.25
k Prec. Recall F1
F1 measure = 2 • prec • recall / (prec. + recall)
k1234567
FPTNFNTNTNTNTN
TPTNTNTNTNTNTN
FPTNTNFNTNTNTN
k=1 k1234567
FPFPFNTNTNTNTN
TPFPTNTNTNTNTN
TNFPTNFNTNTNTN
k=2
k1234567
FPFPTPTNTNTNTN
TPFPFPTNTNTNTN
FPFPFPFNTNTNTN
k=3 k1234567
FPFPTPFPTNTNTN
FPFPFPFPTNTNTN
FPFPFPTPTNTNTN
k=4
LOOCV results(Classifications)
Precision = TP/(TP+FP)
Recall = TP/(TP+FN) = TP / (# positives)
k1234567
FP
FN
TP FP
FN
k=1 k1234567
FPFPFN
TPFP
TNFP
FN
k=2
k1234567
FPFPTP
TPFPFP
FPFPFPFN
k=3 k1234567
FPFPTPFP
TPFPFPFP
TNFPFPTP
k=4
Information RetrievalClassifications
Precision = items found/ items retrieved = TP/(TP+FP)
Recall = items found/ items sought = TP/(TP+FN)
LOOCV Results
Parameters:
Minimum number of annotated drugs
Number of functional flow iterations
Tanimoto threshhold for non-zero edge
Precision-Recall Plots:
Leave-One-Out cross-validation for rankingsk of 1 through 50; averages for genes to whichLOOCV was applied
Random Rankings
Comparison of 4 vs. 10 iterationsfor a minimum of 25 annotated drugs/unique gene
and a Tanimoto threshold of 80%
10 iterations is too many (low precision). Note: prec.(1) = recall(1)
0 0.005 0.01 0.015 0.02 0.0250
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 80, annotated 25, intervals 10
test
random
Precision
Rec
all
0 0.01 0.02 0.03 0.040
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 80, annotated 25, intervals 4
test
random
Precision
Rec
all
Comparison of 4 vs. 8 iterationsfor a minimum of 50 annotated drugs/unique gene
and a Tanimoto threshold of 80%
8 iterations is too many (low precision). Note again: For the top-ranked LOOCV functional flow scores precision equals recall (k = 1).
0 0.01 0.02 0.03 0.04 0.050
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
precision
reca
ll
threshold 80, annotated 50, iterations 8
test
random
Precision
Rec
all
0 0.01 0.02 0.03 0.04 0.050
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
precision
reca
ll
threshold 80, annotated 50, intervals 4
test
random
Precision
Rec
all
0 0.01 0.02 0.03 0.040
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 80, annotated 25, intervals 4
test
random
Precision
Rec
all
Comparison of 25 vs. 50 minimum numbers of annotated drugs/unique gene
(for 4 iterations and a Tanimoto threshold of 80%)
Requiring at least 50 annotated drugs increased precision and recall significantly
Effects of averagingKMAX= min(50, #annotated drugs-1)
0 0.01 0.02 0.03 0.04 0.050
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
precision
reca
ll
threshold 80, annotated 50, intervals 4
test
random
Precision
Rec
all
Comparison of 60 vs. 80% Tanimoto thresholds(for 4 iterations and a minimum number of
50 annotated drugs/unique gene)
Increasing the Tanimoto score threshold from 60% to 80%doubled the precision.
0 0.01 0.02 0.03 0.04 0.050
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
precision
reca
ll
threshold 80, annotated 50, intervals 4
test
random
Precision
Rec
all
0 0.005 0.01 0.015 0.020
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 60, annotated 50, intervals 4
test
random
Precision
Rec
all
0 0.005 0.01 0.015 0.020
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 60, annotated 50, intervals 4
test
random
Precision
Rec
all
Comparison of 25 vs. 50 minimum numbers of annotated drugs/unique gene
(for 4 iterations and a Tanimoto threshold of 60%)
0 0.005 0.01 0.015 0.02 0.0250
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 60, annotated 25, intervals 4
test
random
Precision
Rec
all
Effects of averagingKMAX= min(50, #annotated drugs-1)
For Tanimoto score threshold of 60% the precision is low.The results are quite variable for k > 28 with fewer annotated drugs.
0 0.01 0.02 0.03 0.040
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 80, annotated 25, intervals 4
test
random
Precision
Rec
all
0 0.005 0.01 0.015 0.02 0.025 0.03 0.0350
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 80, annotated 10, intervals 4
test
random
Precision
Rec
all
Comparison of 10 vs. 25 minimum numbers of annotated drugs/unique gene
(for 4 iterations and a Tanimoto threshold of 80%)
Requiring at least 25 annotated drugs increased precision significantly, but predictions using fewer annotated drugs may nevertheless be useful
0 0.01 0.02 0.03 0.040
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 80, annotated 25, intervals 4
test
random
Precision
Rec
all
Comparison of 70 vs. 80% Tanimoto thresholds(for 4 iterations and a minimum number of
25 annotated drugs/unique gene)
Increasing the Tanimoto score threshold from 70% to 80%decreased the precision for the top ranked scores (k=1).
0 0.01 0.02 0.03 0.04 0.050
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
threshold 70, annotated 25, intervals 4
test
random
Precision
Rec
all
Effects of averagingKMAX= min(50, #annotated drugs-1)
0 0.005 0.01 0.015 0.02 0.025 0.030
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
cluster threshold 60, annotated 25, iterations 4
test
random
Precision
Rec
all
0 0.01 0.02 0.03 0.04 0.05 0.06 0.070
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
precision
reca
ll
cluster threshold 70, annotated 25, iterations 4
test
random
PrecisionR
ecal
l
Using Clustered Drugs: Comparison of 60 vs. 70% Tanimoto thresholds(for 4 iterations and a minimum number of
25 annotated drugs/unique gene; graphconncomp)
Average Precision of > 6% achieved for top-ranked drugs (k=1) using clustered drugs only
Using Clustered Drugs: 70% Tanimoto thresholds(for 6 iterations and a minimum number of
20 annotated drugs/unique gene)
Effects of averagingKMAX= min(50, #annotated drugs-1)
Average Precision of > 6% achieved for top-ranked drugs (k=1) using clustered drugs only
0 0.01 0.02 0.03 0.04 0.05 0.06 0.070
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
precision
reca
ll
cluster threshold 70, annotated 20, iterations 6
test
random
Precision
Rec
all
0 0.005 0.01 0.015 0.02 0.025 0.030
0.05
0.1
0.15
0.2
0.25
0.3
0.35
precision
reca
ll
Disease to Drugs 80% threashold 50 annotations 4 intervals
test
random
Disease to Drugs: 80% Tanimoto threshold4 iterations and a minimum number of
50 annotated drugs/unique disease)
Precision
Rec
all
Average precision for top ranks (k=1) is only 2%, butLOOCV precison is double that of random model for k < 10.
Conclusions
With Tanimoto thresholds of 70-80% and relatively fewiterations (~4), Functional Flow may be useful to predicting new drugs that will interact with genes and diseases.
Descisions on parameters will depend on the economics of trading less precision for greater recall (increasing k) and the performance of Leave-One-Out Cross-Validation (LOOCV) for the genes and diseases that are of most interest.
If you look at more rankings you find more drugs, but you have to test more drugs
References
Nabieva, et al., 2005, Whole-proteome prediction of protein functionvia graph-theoretic analysis of interaction maps: bioinformatics, 21, Suppl. 1, 2005, i302–i310.
MacCuish , J. D., and MacCuish, N. E., 2003, Mesa Suite Version 1.2: Fingerprint Module: Mesa Analytics & Computing, LLC
Brown, R. D.; Martin, Y.C., 1996, Use of structure-activity dataTo compare structure-based clustering methods and descriptors for use in compound selection: J. Chem. Inf. Compu. Sci, 36, 572-584.
Gunther, et al., 2007, Super target and Matador: resources for exploring drug-target relationships, Nucleic Acids Research, 1-4
Acknowledgments
Special thanks to Drs. Predrag Radivojac, David Wild, Sun Kim, Mehemet Dalkilic, Rajarshi Guha, Haixu Tang and the faculty of Bioinformatics and Cheminformatics. Also thanks to Jefferson Davis (Math/Stat), Bob Konicek, and of course Linda Hostetter.
Thank you all and enjoy the rest of the summer!