Post on 05-Dec-2014
description
Confidential
Presenter:Marc Hadfield
marc@alitora.comwww.alitora.com
Natural Language Processing
& Semantic Modelsin an Imperfect World
Copyright Alitora Systems, Inc. 2009
Marc Hadfield
CTO of Alitora Systems Computer Science Research in Bioinformatics
NLP Big (Fuzzy) Networks
Generalized Semantic Data Platform
Alitora Systems
System Approach
…Talk about Systems & Apps more than Modules.
Discussion Today
Storing Data – Semantic Repository Generating Data – NLP Modeling Data – Semantic Models Analyze Data – Methodology Using Data – Application
Alitora Systems Architecture
Alitora Systems API (ASAPI)
User Interfaces ASAPI Collaboration kHarmony™ Semantic
DB Alitora Foundry
Text-Mining UMIS Secure
Distributed URIs URI to Named Graphs
ASAPI Cloud
Multi-Billion Triples
kHarmony™ Semantic DB
Semantic / Graph DB Cloud Deployable
Distribute Data over Servers Layers of Cache
Data Analytics / Clustering Determine High-Value
Knowledge Knowledge Relevancy
Embedded Scripting Data Entitlements
Users, Teams, Organizations, Colleagues
Base Ontology
Alitora Foundry
Manages NLP processes Annotators which add metadata to text
Includes external services like OpenCalais as annotators
Workflows to link annotators together Common data representation across
components RDF in, RDF out Ontology includes representation of
certainty, error
Foundry Workflow
Independent Workflows based on type of text
Combine ML &Rule-based systems
Foundry Data Model
Two dimensional representation of tokens Labels/Spans to tag token ranges (features in machine learning)
Allows multiple interpretations of tokens Chemical names tokenized differently than personal names
Sequence Recognition and Categorization (with scoring/likelyhood) Entities, Entity Types, Normalized (Disambiguated) Entities (ER vs. ER)
Shared across workflow steps Direct RDF representation
“Span”
NLP In Action
Copyright Alitora Systems, Inc. 2009Confidential
Sentence
“Suppression of endogenous Bim greatly inhibits Gadd45a induction of apoptosis.”
Parse [action, inhibit, [action, suppress, [unknown], [gp, endogenous Bim] ], [action, induce, [gp, Gadd45a], [process, apoptosis] ], ]
Confidential Copyright Alitora Systems, Inc. 2009
Foundry Relationship Extraction
Alitora Knowledge Ontology
Data Representation:
Each Object is Named Graph. Unique URI.
“chunks” of RDF
OWL2
“Core” Model
Alitora Knowledge Ontology
Named Graphs:
•URI
•“Reified”
•Provenance
• Hash/Signature
• Creation, Modification, Expiration Dates
•Certainty/Error
Alitora Knowledge Ontology
Lesson:
“Reification” at the model level.
Expose the topology of the knowledge.
Semantic Knowledge StatementsDomain Ontology + Instance Statements
Alitora Knowledge Ontology
Semantic Collaborative Statements
Alitora Knowledge Ontology
Alitora Knowledge Ontology
Fact Representation This example has 9
Named Graphs The “Relation” is the
head Any number of
Relation-Parts Relation-Parts are
chained
“Company Merger”
•OWL
•“Reified”
•Knowledge Representation
•Certainty, Error, Provenance, …
•Graph + Semantic
•Topology Interpretation
•Logical Interpretation
Alitora Knowledge Ontology
MemomicsBio Ontology (Domain) Extends Alitora Knowledge Ontology
Inherits knowledge representation structures OWL Domain Specific Defines types of “facts” specific to
biomedical domain A general AKO fact can be
mapped/asserted into a Memomics BioOntology fact
Where are we?
Store Data Generate data with NLP Represent data in a general knowledge
model Have a domain specific ontology
Where the “action” happens
Need some analysis to push facts into the domain ontology
Query, Inference using the domain ontology
Relevancy
The shape or “topology” of the graph helps to identify relevant knowledge.
The “paths” connecting a User to knowledge, based on search usage, factor into Relevancy
“Knowledge Rank” “Best” facts
Relevancy based onGraph Topology
Scripting, Analysis, Inference Submitted Scripts applied over Graph Walk
Groovy Scripts (Java Interface) Can calculate “scores”
Offline Clustering and Analysis Algorithms Grid/Cloud based
Inference process utilizes knowledge Asserting statements (Relation Statement) Prolog, HiLog, F-Logic Use all features in inferencing (such as certainty)
Certainty
How accurate (F-score) are your NLP extractions?
How accurate is the source material? How dynamic is your domain? Can facts be independently verified
Do multiple sources reinforce a “fact”? Can your community of users curate or
validate information? How sensitive are you to error?
Will users tolerate error (such as in search) or are you trying to inference over absolute “truth”?
Certainty
Choose to assert facts(or not)based on certainty assessments
Confidential
Guided Inference
Inference is guided by ranked knowledge
Analysis can be performed offline
Guided Inference
Dynamic Inference / Rules A question/query is posed to initiate the
inference Knowledge-based is queried to collect
relevant data Certainty Thresholds can be used Relevancy Thresholds can be used
AKO Relations are asserted as “facts” to extend the inference
Process is repeated to add assertions
Demonstrations
Alitora Newstracker Sage Commons, Biomedical Domain Match Engine, Consumer Application
Alitora News Tracker
Track highly relevant news in domain niche
Use NLP to extract entities and relations of interest
Use certainty assessments as thresholds to consider entities/relations
Use a score (an embedded script) to assign a relevancy to news articles Heuristic including entities types in articles,
relationship types, et cetera
Application: News Tracker
Application: Sage Commons
Share networks of biomedical data across the community of researchers million node networks, billions of triples
Extended AKO with Sage Ontology Use for structured data and unstructured data
Allow combination of structured data with NLP derived data
Use certainty thresholds to cut down on noise Use relevancy for efficient queries Expose data for guided inferencing
Application: Match Engine
Match Engine Extended AKO with Match Ontology Foundry for extracting music event entities
Performer, Venue, Price, Genre Certainty for reducing noise Match Engine uses inference with multiple
source of “evidence” to match users with events
Demo Application: Bandalay Facebook App
NLP and (Un)Certainty
Capture Error / Uncertainty in Model from NLP “Reify” relationships so metadata will “fit” Use multiple types of analysis
Rules, Machine Learning, Topology, Curation, User Feedback
Separate general model and domain model Allows asserting a fact in the domain model or not (don’t
“decide” everything at once) Use semantics to make decisions about data Inference can use thresholds to decide to assert
facts (or not) Guided Inference can make informed choice about
facts to add/remove from model
Contact Information
750 Menlo Ave, Suite 340 155 Water Street
Menlo Park, CA 94025 Brooklyn, NY 11201
(415) 310-4406 (917) 463-4776
marc@alitora.com
peter@alitora.com
ConfidentialCopyright Alitora Systems, Inc. 2009