Function and Phenotype Prediction through Data and Knowledge Fusion
-
Upload
karin-verspoor -
Category
Health & Medicine
-
view
259 -
download
0
Transcript of Function and Phenotype Prediction through Data and Knowledge Fusion
![Page 1: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/1.jpg)
Function and Phenotype Prediction through Data and Knowledge Fusion
Karin M. Verspoor, The University of [email protected]
27 January 2016 – King Abdullah University of Science and Technology, Computational Bioscience Research Center
![Page 2: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/2.jpg)
We have the blueprints to life,but we don’t know how to read them.
• At least a quarter of protein families in PFAM have no known function (Domains of Unknown Function)
• Millions of proteins uncharacterised
![Page 3: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/3.jpg)
From sequence to function
?
![Page 4: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/4.jpg)
What is protein function?
• Captures biologicalprocess, molecularfunction, cellularcomponent
• Common representation for Model organism databases tofacilitate sharing
The Gene Ontology (GO) provides a vocabulary
![Page 5: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/5.jpg)
What about phenotype?
Human Phenotype Ontology
![Page 6: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/6.jpg)
Knowledge-based featuresKnowledge source:
![Page 7: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/7.jpg)
Exponential knowledge growth
• ~1550 peer-reviewed gene-related databases in NAR online Mol Bio collection
• Over 25 million PubMed entries (> 2,000/day)
• Breakdown of disciplinary boundaries makes more of it relevant to each of us
• “Like drinking from a firehose” – Jim Ostell (NCBI IEB Chief)
![Page 8: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/8.jpg)
Text as a primary source of knowledge
Despite ever increasing structured resources, the literature remains the primary repository of knowledge in biomedicine
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
# Sw
iss-
Prot
Pro
tein
s
Proteins missing a FUNCTION commentProteins gaining a FUNCTION comment
“Manual curation is not sufficient for annotation of genomic databases”Baumgartner et al Bioinformatics (ISMB 2007)
![Page 9: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/9.jpg)
Why biomedical text mining?
1914 1921 1928 1935 1942 1949 1956 1963 1970 1977 1984 1991 1998 2005 20120
200000
400000
600000
800000
1000000
1200000
Year
Publ
icat
ions
per
yea
r
Exponential growth in size of Pubmed
![Page 10: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/10.jpg)
Data sources, Data Integration
• Structured Resources– Largely manually ‘curated’, high quality– Often unannotated– Organizes targeted information– Computable
• Unstructured Resources– Literature: peer reviewed, well-formed– Natural Language: ambiguity, complexity– Broad, current coverage of biological knowledge– Intended for Human communication
![Page 11: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/11.jpg)
Bio Text Analysis in a nutshell
![Page 12: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/12.jpg)
GO Function Prediction
Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76.Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.
![Page 13: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/13.jpg)
GOstruct: Structured output SVM
![Page 14: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/14.jpg)
Structured output
• Represent a set of annotations as a single vector• Encodes the hierarchical structure from annotation to
root
![Page 15: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/15.jpg)
GOstruct approach
“What functions does this protein perform?”
![Page 16: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/16.jpg)
Feature integration via kernels
• Cross-species (sequence-based) features– e-values from significant BLAST hits– features from WoLF PSORT protein localization software– transmembrane protein prediction using TMHMM– k-mer composition of the N and C termini– low complexity regions
• Species-specific features– Protein interactions– Gene Expression– Phylogenetic profiles– Text-derived features
![Page 17: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/17.jpg)
Extraction & Analysis pipeline
Christopher Funk (2015) PhD dissertation, U. Colorado Denver
![Page 18: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/18.jpg)
Integrating Text
• Protein – Gene Ontology term co-occurrence• Protein – Protein co-occurrence
Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10
![Page 19: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/19.jpg)
Text-based features
• Words– (tokens)
• Entities or Concepts– (gene/protein mentions)– (gene ontology concepts)
• Relations– (simple co-occurrences)
![Page 20: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/20.jpg)
Feature Extraction from text
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
![Page 21: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/21.jpg)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
![Page 22: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/22.jpg)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of words:WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known, extracellular, invasion, … , progression)WordsSent2(protein, and, message, levels, of, was , …)
![Page 23: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/23.jpg)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:sent_comen(P50281, GO:0008237)sent_comen(P50281, GO:0006508)sent_comen(P50281, GO:0009056)sent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)
![Page 24: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/24.jpg)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:nonSent_comen(P50281, GO:0008237)nonSent_comen(P50281, GO:0006508)nonSent_comen(P50281, GO:0009056)nonSent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)
![Page 25: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/25.jpg)
Feature Representation
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of Words:P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …Protein GO term co-mentions (sentence):P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1Protein GO term co-mentions (non-sentence):P40281, GO:0010467=2, GO:0005623=2
![Page 26: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/26.jpg)
Feature Representation
Bag of Words:UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiUniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi…UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiProtein GO term co-mentions (sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiProtein GO term co-mentions (non-sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
![Page 27: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/27.jpg)
An aside on GO concept recognition
• Given:– Gene Ontology (~46,000 concepts)
In mice lacking ephrin-A5 function, cell proliferation and survival of newborn neurons… (PMID 20474079)
• Return:– GO:0008283 cell proliferation– GO:0005125 cytokine activity– GO:0048666 neuron development
(can be based on a judgment about the depth of experimental evidence)
![Page 28: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/28.jpg)
(CRAFT example)
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
GO:0005623 – “cell”CL:0000000 – “cell”
PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”
SO:0001059 – “sequence_alteration” GO:0006810 – “transport”
SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
![Page 29: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/29.jpg)
GO:0006900 – membrane budding
[Term]id: GO:0006900name: membrane budding…def: "The evagination of a membrane,
resulting in formation of a vesicle.”…synonym: "membrane evagination”synonym: "nonselective vesicle assembly”synonym: "vesicle biosynthesis”synonym: "vesicle formation”…
Variation in PMID: 12925238• Lipid rafts play a key role in
membrane budding…• …involvement of annexin A7
in budding of vesicles…• …Ca2+-mediated vesiculation
process was not impared.• Red blood cells which lack the
ability to vesiculate cause…• Having excluded a direct role
in vesicle formation…
GO vs NL
![Page 30: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/30.jpg)
Comparing tool performance on CR
• NCBO Annotator (96 combinations)
wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize, withSynonyms
• MetaMap (864 combinations)
model, gaps, wordOrder,acronymAbb, derivationalVars, scoreFilter, minTermSize
• Concept Mapper(576 combinations)
searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords, synonyms
Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.
![Page 31: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/31.jpg)
Literature alone is useful
MF BP CC0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Baseline (co-mentions as predictions) Co-mentionsBoW Co-mentions + BoW
Gene Ontology Branch
Mac
ro-a
vera
ged
F-m
easu
re
![Page 32: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/32.jpg)
Literature features approach performance of commonly used biological features
MF BP CC0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Trans/LocalizationHomologyNetworkLiteratureAll Combined
Mac
ro-a
vera
ged
F-m
easu
re
(and combining them with other features is even better!)
![Page 33: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/33.jpg)
Manual inspection of misclassifications
Some false positives appear to have literature support:
• GCNT1 – carbohydrate metabolic process (Q02742 - GO:0005975)Genes related to carbohydrate metabolism include PPP1R3C, B3GNT1, and GCNT1…[PMID:23646466]
• CERS2 – ceramide biosynthetic process (Q96G23 - GO:0046513)…CersS2, which uses C22-CoA forceramide synthesis… [PMID:22144673]
![Page 34: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/34.jpg)
Results: Multi-view learning
![Page 35: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/35.jpg)
Results: different sources
Mouse annotations from geneontology.org
![Page 36: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/36.jpg)
Phenotype Prediction
Kahanda I, Funk C, Verspoor K and Ben-Hur A 2015;] F1000Research 2015, 4:259 (doi: 10.12688/f1000research.6670.1)
![Page 37: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/37.jpg)
PHENOstruct
Human Phenotype OntologyOrgan, Inheritance, Onset subontologies have separate models
![Page 38: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/38.jpg)
Gold annotations via transfer
![Page 39: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/39.jpg)
PHENOstruct Features
• Network (functional association data)– protein-protein interactions– co-expression– co-localization– From BioGRID, STRING, GeneMANIA
• Gene Ontology (experimental) annotations• Literature mined data: bag of words in gene sentences• Genetic variants (protein -> disease -> variants)
![Page 40: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/40.jpg)
Performance
Subont. Terms Method AUC P-value
Organ 1,796
Binary SVMs 0.66 1.70E-262Clus-HMC-Ens 0.65 0.00E+00PHENOstruct 0.73 —
Inheritance 12
Binary SVMs 0.72 2.20E-01Clus-HMC-Ens 0.73 7.30E-01PHENOstruct 0.74 —
Onset 23
Binary SVMs 0.62 4.40E-03Clus-HMC-Ens 0.58 3.30E-05PHENOstruct 0.64 —
![Page 41: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/41.jpg)
PHENOstruct in Organ subontology
![Page 42: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/42.jpg)
Gold vs Predicted, P43681
Gold Predicted
Hierarchical, protein-centricP = 1.0; R = 0.62
![Page 43: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/43.jpg)
Impact of data sources
![Page 44: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/44.jpg)
Leave-one-source-out
![Page 45: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/45.jpg)
Top Literature Features
Category Tokens
proteins & complexes
cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1-methyl-4-phenylpyridinium,tnf-alpha-mediated, tnf-alpha-stimulated, tnf–mediated, ink4a/arf, ns4b, hmsh6, fukutin, cdtb, ns5b,apoai, tnf–stimulated, ns4a, tnf-alpha-, rhbmp-2, tnf-alpha-treated, frataxin, ki-ras, connexin32, tcdb,recql4, =-galcer, tyrosinase-related, hpms2, her4, cd40-cd40l, lmp2a, ryrs, mg2+-atpase, ews-fli1,abeta42, fancc, p40phox, her1, bdnf-induced, trap+, gfap-ir, daf-16/foxo, hdl3, -238, [tnf-alpha],cd40/cd40l, tnf–treated, anti-ngf, tep1, recq, nt-4, pfemp1, zo-2, nphp1, tnf-alpha-dependent,
pomt1, igm-positive, apoa-ii, p110alpha, fancf, tbx4, anti-cd40l, igggenes hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2pathways ras/raf/mek/erk, pi3k-akt-mtordiseases/phenotypes
cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth, amenorrhea, rett, anticardiolipin
misc.sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid, wlds, caii, aoa, tdf,
crysalis, wingx, amf
![Page 46: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/46.jpg)
Conclusions
• The literature provides a significant resource for biological function prediction
• The literature provides one ‘view’ of biological knowledge and is best combined with other resources
• Even some simple strategies for extracting associations from the literature can provide valuable information, taken at large scale– “bag of words” and co-occurrence models reasonable
starting point: capture implied relationships– scope for integration of more targeted extracted
relationships (e.g. protein-protein interactions), with the usual Precision/Recall tradeoff
![Page 47: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/47.jpg)
Acknowledgements
• Los Alamos National Laboratory– Michael Wall
• Colorado– Larry Hunter (U. Colorado Denver)– Christopher Funk (U. Colorado Denver)– Asa Ben-Hur (Colorado State University)– Indika Kahanda (Colorado State University)
• NICTA Victoria Research Laboratory– Geoffrey Macintyre (U. Cambridge)– Antonio Jimeno Yepes (IBM Research Australia)– Cheng-Soon Ong (NICTA Canberra)
• Funding: US NIH, US NSF, NICTA, Australian Research Council
![Page 48: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/48.jpg)
Thank you!
![Page 49: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/49.jpg)
Machine learning for text analysis
Training setNotes + labels
for classes of interest
Machine learning algorithm
Words, Phrases,Linguistic categories;
names of entities;Domain concepts; Document features
Biomedical knowledge sources
UMLSOBOs
Language processing
ModelRelating features
of the text to classes of interest
![Page 50: Function and Phenotype Prediction through Data and Knowledge Fusion](https://reader036.fdocuments.in/reader036/viewer/2022070523/58ed8ba11a28abeb128b45d7/html5/thumbnails/50.jpg)
Machine learning for text analysis
New textto be classified
Words, Phrases,Linguistic categories;
names of entities;Domain concepts; Document features
Biomedical knowledge sources
UMLSOBOs
Language processing
Model
Predicted Classification
(label)