Seed-based Generation of PersonalizedBio-Ontologies for Information Extraction
Cui Tao & David W. EmbleyData Extraction Research GroupDepartment of Computer Science
Brigham Young University
Supported by NSF
Personalized Information Harvesting
• Biology domain huge (other domains too)• Data collection
– Many (web) sources– Only a tiny subpart wanted– Personalized view
• Personalized extraction ontology– Creation: Form specification– Application: Seed-based harvesting
Example• Harvest information about large proteins in humans
and the functions of these proteins– Find proteins in humans that are >20 kDa – Find all the proteins in humans that serve as receptors– ...
• Information sources various online repositories– NCBI– Gene Cards– The Gene Ontology– GPM Proteomics Database – …
Extraction Ontology
Instance: ^\d{1,5}(\.\d{1,2})?
Context: weight|wght|wt\.
Unit: kilodaltons?|kdas?|kds?|das?|daltons?
…
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…
Extraction Ontology
Instance: ^\d{1,5}(\.\d{1,2})?
Context: weight|wght|wt\.
Unit: kilodaltons?|kdas?|kds?|das?|daltons?
…
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…
Can We Make Construction Easier?• Forms
– General familiarity– Reasonable conceptual framework– Appropriate correspondence
• Transformable to ontological descriptions• Capable of accepting source data
• Instance recognizers– Some pre-existing instance recognizers– Lexicons
• Need for a full extraction ontology?
Form Creation User InterfaceBasic form-construction facilities:• single-entry field• multiple-entry field• nested form• …
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3
Name
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3
Name
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
Can Now Harvest
Name
14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E
Can Now Harvest
Name
Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3
Can Now Harvest
Name
Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS
Can Harvest from Additional Sites
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
Larger Picture
• Information Harvesting– Not only for biology, but for any application– Not only from one site, but from many sites
• Opportunities– Extraction ontology creation– Automating site-to-site information harvesting– Automatic semantic annotation– Data/Ontology transformations
Extraction Ontology CreationLexicons
Name
14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
Name
Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS
…14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E…T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS…
Ontology TransformationOWL & RDF: standard ontology languages
XML & XMLS: data exchange
Forms: form filling to populate an ontology
Top Related