1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO
description
Transcript of 1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO
Algorithm to populate Telecom domain OWL-DL ontology
with A-box object properties derived from Technical Support Documents
1Kouznetsov A, 2Shoebottom B, 1Baker CJO
1 Department of Computer Science and Applied Statistics, University of New Brunswick, Saint John, Canada2 Innovatia, Inc, Saint John, Canada
Motivation: Why Ontology-Centric?
• Problem: To respond information requests timely contact center workers need to search through many types of knowledge resources
• Challenge: increasing quality of service and decreasing contact center costs
• Solution: using the ontology centric‐ platform– less escalation to more experienced workers– less time spent in resolving cases– training time is also greatly reduced
Motivation: Why Text Mining?
• Problem : Significant time spent by highly educated experts in populating ontology.
• Challenge: Reduce the workload• Solution: Apply text mining - semiautomatic
method for extracting information, specifically named entities and their relations, from texts and populating a domain ontology.
Focus
• We are focused on the problem of accurately extracting and populating relations between the named entities and presenting them as object properties between A-box individuals in an OWL-DL ontology.
Populate A-box Object Property. Single Property
Domain ClassMan
Range ClassWoman
Object Property
hasSister
Domain InstanceSamuel
Range InstanceMary?
T-Box
A-Box
Populate A-box Object Property. Multi-properties
Domain ClassMan
Range ClassWoman
Object Property
hasSister
T-Box
A-Box
Object Property
hasMother
Domain Instance
SamuelRange Instance
MaryhasSister
?
hasMother
?
More complicate case….
Domain Instance
SamuelRange Instance
Mary
hasSister ?
hasMother ?
hasSameLastName
?
Methodology
• Ontology-based information retrieval applies Natural Language processing (NLP) to link text segments, named entities and relations between named entities to existing ontologies.
• Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms
• Score A-box property candidates by using functions of distance between co-occurred terms.
• A-box Property prediction and population based on these scores (Thresholds, Fuzzy approach)
Main Implementation tools
Java
GATE/JAPE
OWLAPI
Semi-Automatic Ontology populating pipeline
Source Documents
XML
Preprocessing
SynonymsLists
TextSegmentsProcessing
TextSegments
Separation
Sentences
Tables
Other Text Segments
Ontologyunpopulated
(OWL)
Term List(Excel)
OntologyPopulation
Named Entities
Single Relations
MultiRelations
Populated Ontology
Using Ontology
Reasoning
Visualizing
VisualQueries
Connecting Recourses
Populating Ontology
Scoring Framework
Co-occurrence Based Scores
generator
Relation Framework for A-box
candidates extraction
Candidate
Decision Framework
Decisionmodule
Reasoning
Ontology
Scores
Focus
LabelledDataTres
Co-occurrence Based Scores generator
Co-occurrence Based Scores generator (Light version)
A-box CandidateAll related content
Scores
Relations Framework
Relation Object
Tokenizer
Gazetteer
Score calculator
IntegratorFragments Processor
Synonyms List
Generation of Scores
• Relation Collection
Framework to process Relation objects
• Relation Object
integrates object property with:• all types of related text fragments• ontology objects• and score processing intermediate and final results
identified as : Domain Class: Domain Instance : Object Property : Range Class: Range Instance
Scores Generator: Details
Score Calculator: • Score calculation for text fragments associated
with the Relation .
• Current version based on distance between occurred entities and number of text fragments with co-occurrence
• Includes by Text Fragments Processor and Integrator
2-terms and 3-terms scoring system
Tokenizer
Score Gazeteer
ScoreProcessor
Domain Synonyms list
RangeSynonyms list
Object Property
Synonyms list
Tokenized sentence
sentencescore
Legend Legacy (2 terms) System
Modified/Added on new (3 terms) system
Multiple Formats Score Generation
Technical documentation contains knowledge displayed in multiple formats, each requiring different processing subroutines:
• Table Processing• Sentence Processing• Other segments
Extensible Data Model
Document Segment
Table Segment
Data Cell
IDContent
Row Header
IDContent
Column Header
IDContent
Table Header
IDContent
Text Segment
Sentence
IDContent
Document
Corpus
Doc ID
Options: Sections, Paragraphs, Bullet lists, Headings
A-Box Prop. Population
A-Box property candidates list
Text Mining
corpus
Gazetteer List
A-Box Obj. Properties (399)
Properties with occurrence of domain
or rangeIndividuals (256)
Properties with co-occurrence of
domain and rangeIndividuals (143)
Ontology processing
T-Box Obj. Properties (102)
A-Box scoring
Evidences for A-box Obj. Property candidates
Current A-box Object Property Candidate
Evidences for Current A-box (co-occurrence of Domain and Range)
Text Segment
Sentence
IDContent
Text Segment
Sentence
IDContent
Text Segment
Sentence
IDContent
Text Segment
Sentence
IDContent
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Evidences for Current A-box (occurrence of Domain or Range)
Text Segment
Sentence
IDContent
Text Segment
Sentence
IDContent
Text Segment
Sentence
IDContent
Text Segment
Sentence
IDContent
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
Table Segments: Primary ScoringTable Segment
Data Cell
IDContent
Row Header
IDContent
Column Header
IDContent
Table Header
IDContent
A-Box scoring
Current A-box Object Property Candidate
Domain Property Range
Table Segments: Secondary ScoringTable Segment
Data Cell
IDContent
Row Header
IDContent
Column Header
IDContent
Table Header
IDContent
A-Box scoring
Current A-box Object Property Candidate
Domain Property Range
Sentence Scoring• A-box Object property Score for sentenceSentenceScore=1/(distance+1)+Bonus
• Integrated Object property Score over all related sentences
IntegratedScore= SUM(SentenceScore)
• Summarize Integrated Score with Table Scores
• Normalized Object property Score NormolizedScore= IntegratedScore/Norm
Sentence scoring Score=1/(distance+1)+Bonus
< > </ > 1D R
< > </ > 21 2 3D 4 R
< > </ > 41 2 PD 4 R
< > </ > 31 2 3D 4 R 6 P
Domain Synonym Range Synonym Object Property Synonym
D R P
Distance: 1000, Bonus =0, Score= 1/(1000+1)+0=0.00099
Distance: 4, Bonus =0, Score= 1/(4+1)+0=0.2
Distance: 6, Bonus =3, Score= 1/(6+1)+3=3.14
Distance: 4, Bonus =10, Score= 1/(4+1)+10=10.2
Example Sentence Type 1< > </ > 1D R
Distance: 1000, Bonus =0, Score= 1/(1000+1)+0=0.00099
sentence before cleaning: ["<Paragraph></Action> <Figure Numbered="Unnumbered" Position="Inline" TextSize="medium" Width="column" frame="all" id="DLM-11334063" xml:lang="en"><image border-style="none" border-width="medium" xml:lang="en" href="ERGNN46205-301Loosening_screws_on_the_SDM_FW4_8010co_chassis33b.png"/></Figure></Step><Step xml:lang="en"><Action><Paragraph xml:lang="en">Rotate the insert/extractlevers to eject the 8660 SDM from the chassis.] Final Score=9.99000999000999E-4 Best Bonus=0.0 Final Distance=1000.0
Telecommunications_Chassis:8010co_Chassis:hasChassis_Shipping_Accessories:Telecommunications_Chassis_Screws:Screws
Property Synonyms:
•need•have•require•has
Domain Synonyms:•8010co chassis•8010co Chassis•8010 CO chassis•8010co•8010CO chassis
Range Synonyms:
•Screws•screws
Example Sentence Type 2
sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other.
Final Score=0.05Best Bonus=0.0 Final Distance=19
Telecommunications_Chassis:Chassis:hasChassis_Components:Telecommunications_Chassis_Power_Supply:Power_Supply
Property Synonyms:
•have•has
Domain Synonyms:
•chassis•switch chassis•8000 series•Chassis•CO chassis
Range Synonyms:
•Power Supply•transformer•power supply•power module•Power supply
< > </ > 21 2 3D 4 R
Example Sentence Type 4
sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other.
Final Score=10.05Best Bonus=10.0 Final Distance=19
Telecommunications_Chassis_Power_Supply:Power_Supply:isPart_of_Chassis:Telecommunications_Chassis:Chassis
Property Synonyms:
•used in•include
Domain Synonyms:
•Power Supply•transformer•power supply•power module•Power supply
Range Synonyms:
•chassis•switch chassis•8000 series•Chassis•CO chassis
< > </ > 41 2 PD 4 R
Bonus Calculation
< > </ >1 2 PD 4 R6
< > </ >1 2 3D R6P Distance: 6, Bonus Constant =10, Tokens in Property=2, Score= 1/(6+1)+2*10=20.14
Distance: 6, Bonus Constant=10, Tokens in Property=1, Score= 1/(6+1)+1*10=10.14
P
3
Bonus= Bonus Constant * Number of tokens in property
Sentence Example: Device X does not support Device Y
Object Properly Tokens Number Obtained Score Support 1 1/(3+1)+1*10=10.25 Not Support 2 1/(3+1)+2*10=20.25 V
Normalization• Norm coefficient for A-box object property
Log(1.0+(NSD+1.0/Cd) *(NSR+1.0/Cr) )NSD – Number Of Sentences Domain OccurredCd – Domain Synonyms List CardinalityNSR – Number Of Sentences Range OccurredCr – Range Synonyms List Cardinality
Gold Standard and Evaluation Framework
A-BoxOntology
T-Box Ontology
LabelsEvaluation
Report
Source Documents
XML
Preprocessing
Synonyms
Lists
TextSegmentsProcessing
TextSegment
sSeparati
on
Sentences
Tables
BulletLists
Ontologyunpopulated
(OWL)
Term List(Excel)
OntologyPopulation
Named
Entities
Single Relatio
ns
MultiRelatio
ns
Populated Ontology
Using Ontology
Reasoning
Visualizing
VisualQueries
Connecting
Recourses
PopulateOntology
Prediction evaluation Framework
Evaluate predictedProperties
/Update DB
Golden StandardDatabase
Import labels
KnowledgeEngineer
Thresholds: Decision Boundary
All scores for each A-box property candidate are summarized for based on eligible sources of evidence for the A-box in question
Threshold in use Trade off - Recall vs. Precision
Results for Tables: Baseline result
Focus on Positive class Recall and Positive class Precision
Class of interest (Positive class) Recall =0.80 Precision=0.85
Results for Tables: Continued
Focus on Positive class Precision
Class of interest (Positive class) Recall =0.25 Precision=1.0
Results for Tables: Continued
Focus on Positive class Recall
Class of interest (Positive class) Recall =1.0 Precision=77.5
Results for Sentences
Focus on Positive class Precision
Class of interest (Positive class) Recall =0.14 Precision=1.0
Results for Sentences and Tables
Focus on Positive class Precision Class of interest (Positive class)
Recall =0.4 Precision=1.0
Synergetic effect of using Sentences and Tables (wrt Precision=1.0):
Recall (sentences)= 0.14 Recall (tables)= 0.25 Recall (sentences & tables)= 0.4
Advantages Improve Quality of Knowledge Base
Managing the argumentation process KB vs KE Iterative improvement of accuracy
Tier1 doing Tier 2 task (improve service)Tier1 (high precision) KB queryTier 2 (high recall) – knowledge integration Facilitate information processing without KE
Reduce workload (saving)
Improve Quality of Knowledge Base
• Offline task by Knowledge Engineer • Disambiguation– Expert can pay special attention to any significant
inconsistency in human and machine outputs such as - Highly scored A-box candidates labeled as negatives
• Human Expert & Machine Committee vs. single human expert
Real Time Integration of New Evidence
• Online, by call centre worker, at knowledge use stage– Extracting additional object properties from new
documents for emergency case– High Positive Precision focused scenario
• Offline, by Senior call centre worker, at knowledge use stage– Extracting additional object properties from new
documents for questions not answered online– High Positive Recall focused scenario
Reduce Workload
• Online and Offline • Automatically Extracted Evidenced• Ranked Solutions with notified level of
confidence
Gold Standard Corpus and Evaluation Framework
A-BoxOntology
T-Box Ontology
LabelsEvaluation
Report
Source Documents
XML
Preprocessing
Synonyms
Lists
TextSegmentsProcessing
TextSegment
sSeparati
on
Sentences
Tables
BulletLists
Ontologyunpopulated
(OWL)
Term List(Excel)
OntologyPopulation
Named
Entities
Single Relatio
ns
MultiRelatio
ns
Populated Ontology
Using Ontology
Reasoning
Visualizing
VisualQueries
Connecting
Recourses
PopulateOntology
Prediction evaluation Framework
Evaluate predictedProperties
/Update DB
Golden StandardDatabase
Import labels
KnowledgeEngineer
Future Work: Extend Literature Scheme
• Sections• Paragraphs• Bullet Lists• Connect with Headings and Topics