Another approach to Information Extraction Marek Nekvasil [email protected] using Extended Ontologies.
-
Upload
olivia-singleton -
Category
Documents
-
view
216 -
download
0
Transcript of Another approach to Information Extraction Marek Nekvasil [email protected] using Extended Ontologies.
agenda
gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction
method
wrapping up a document
synonym to identifying relevant information in the document
there are many ways how to wrap a document up
wrapper classes
string-based wrappers Kushmerick‘s wrapper classes
tree-based wrappers XPath Elog finite automata
Methods Comparison
<HTML> <TITLE>Ceny pobytů</TITLE> <BODY> <B>Řecko - Lefkada</B> <I>16 299 Kč</I><BR> <B>Mallorca - Santa Ponsa</B> <I>21 100 Kč</I><BR> <B>Egypt - Sharm El Sheikh</B> <I>18 500 Kč</I><BR> <B>Egypt - Ghiza</B> <I>19 049 Kč</I><BR> </BODY></HTML>
LR class
basic class (stands for Left-Right) 2n parameters (2 for every part of
extracted tuple) example:
suitable wrapper LR(<B>; </B>; <I>; </I>)
other LR class derivates
Nicolas Kushmerick‘s classes HLRT (Head-Left-Right-Tail) OCLR (Opening-Closing-Left-Right) HOCLRT (…) N-LR or N-HLRT (Nested-…)
XPath wrappers
using XPath queries to identify data in the tree representation of a document
often using just the very basic features of the XPath language
usually building queries from the root of a document
Elog
declarative language similar to Prolog uses predicates to generate instances
used in the Lixto tool example of Elog wrapper
finite automata
FSM can be used for wrapping in various ways
usually used for searching in the linear representation of a document
Carme shows it is possible to use FSM for searching in the tree structure
methods comparison
Tree-based wrappers are more error-prone than linear string-based wrappers
Elog and N-LR allow extraction not only from tabular data structure but also from a general hierarchical data structure
XPath wrappers reuse a well defined standard
agenda
gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction
method
building a wrapper
by hand Oracle and PAC analysis interactive visual pattern design tree-fragment queries tree traversal pattern generalization and many other …
PAC analysis
uses an abstract function called Oracle to gather enough example instances of extracted class (asuming it‘s embrased by human)
gathers examples until it has enough N to suggest a wrapper class with a designated error e on a given probality level 1-d, using the formula:
finally searches for the first set of parameters of the wrapper to match all the exmaples
d
R
eN
)(log
2 >
interactive visual pattern design
used in Lixto tool to craft wrappers in Elog language
first user points out the example instances which makes a generating rule, a pattern
then the user forms conditions (filters) of the patterns to restrict them, which is done visually
tree-fragment queries
searching such a minimum XPath query that forms a tree-prefix to all examples tree-prefix examples
tree traversal pattern generalization
application of the graph theory on the generalized document tree
searching the shortest path through the document tree and thus forming an efficient XPath query
agenda
gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction
method
ontologies and wrappers
ontology is a knowledge model we can make a knowledge model that
summarizes what information we are going to extract
with a nifty extension we can use the ontology to identify examples of what we are going to extract
theese examples can be used to build a wrapper with any method
ontology in OWL
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:Ontology rdf:about=""> <owl:imports rdf:resource=“http://www.somedomain.com/x“/> </owl:Ontology> <owl:Class rdf:ID=“class_A“> <owl:disjointWith rdf:resource=“#class_B“/> </owl:Class> <owl:Class rdf:ID=“class_C“> <owl:subClassOf rdf:resource=“#class_A“/> </owl:Class> <owl:DatatypeProperty rdf:ID="property_A"> <rdfs:domain rdf:resource="#class_A"/> </owl:DatatypeProperty></rdf:RDF>
extending OWL
in the terms of ontologies we extract values of datatype properties
therefore we need some technique to identify (and rank) possible instances of theese values
we suggest a way to define complex templates of typical values of a datatype property
placing a template into the ontology
we estabilish a new namespace: xmlns:ot="http://st.vse.cz/~XNEKM06/ontologytemplates#„
in the new namespace we use an element <ot:Template> to write a template down
such a template can only be joined with a datatype property <owl:DatatypeProperty rdf:ID=„property_A"> <rdfs:domain rdf:resource="#class_B"/> <ot:Template ...> ... </ot:Template> </owl:DatatypeProperty>
agenda
gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction
method
patterns
pattern – a general rule that can be evaluated against any continuous part of a document to see with what degree it matches
template
template – a set of rules that can be evaluated as a whole against any continuous part of a document to see with what degree it matches
a template is a special case of a pattern
thus a template can contain other templates
simple patterns
pattern has an internal algorythm that can (with some parameters) identify possible matches throughout the document with a pattern match degree as an output
moreover we need to infer a degree of evidence certainty which should be our confidence that it really is a value that the pattern was to identify
deriving the degree of evidence certainty 1
let us define two propositions: A – the pattern algorythm identified a
given part of a document E – the part really should have been
identified by that pattern A and E are logical propositions and in
fuzzy logic their truth value is a real number from the interval <0; 1>
deriving the degree of evidence certainty 2
intuitively there should be a relationA E
thanks to modus ponens rule we can write in basic logic
(A & (A E)) E of that we can derive
val(E) val(A & (A E)) and while not wanting to overestimate the
evidence certainty we setval(E) = val(A & (A E))
deriving the degree of evidence certainty 3
now we introduce a parameter of the patternval (A E) = p
we call it pattern precision using for examle Łukasiewicz‘ logic we can
derivee = max (0, a + p -1)
where e stands for val(E) and A for val(A)
deriving the degree of evidence certainty 4
without doubt it‘s true that(E A) E, and (A E) E
while in Łukasiewicz‘ logic we can derive from the above
(A S E) (E A)
and therefore(E A) (A E)
deriving the degree of evidence certainty 5
while we substitute (E A) for (E A) we can derive
(E A) E and we introduce a second parameter
val (E A) = c which we call a pattern completeness
deriving the degree of evidence certainty 6
combinig the two rules above we can derive an ultimate rule
((A & (A E)) (E A)) E and while still not wanting to
overestimate the evidence certainty we can write down (in Łukasiewicz‘ logic)
e = max (max (0, a + p -1), 1 – c)
simple patterns summary
a pattern identifies a given place in the document with a pattern match degree denoted as a
every pattern has two parameters: p – precision and c – completeness
the degree of pattern evidence certainty can then be calculated as
e = max (a + p -1, 1 – c)
composite patterns
as to forming a template we can combine the fragmentary simple patterns together
computing the evidence certainty is the same as it was in case of simple patterns however we have to derive a pattern match degree somehow
deriving the composite pattern match degree
joining evidences of two patterns can be viewed as joining two fuzzy sets
for this we can use either a set union (asociated with disjuntion) or a set intersection (asociated with conjunction)
therefore we compute the composite pattern match degree as the conjuncion or disjunction of evidence certainties of all component patterns
so we get two kinds of templates: conjoint and disjoint
the nature of templates
for the calculations we use the formulae of min-conjuntion and max-disjunction
the parameters p and c of component patterns now get a new meaning
in a disjoint template a high value of p means that the pattern forms a sufficient condition
in a conjoint template a high value of c means that the pattern forms a necessary condition
writing down the templates
we write the template down as to match it with the ontology as was shown before:<ot:Template ot:p=“0.95“ ot:c=“0.8“ ot:type=“disjoint“>
...
</ot:Template>
the component patterns will be written in the form of nested xml tags
a few kinds of patterns
<ot:String ot:p=“0.7“>Egypt</ot:String> <ot:Stringlist ot:source=“c:\temp\zeme.txt“ ot:c=“0.62“/> <ot:Concatenation> ..</..> <ot:Context ot:side="left" ot:maxdistance="1" ot:c="0.5">..</..> <ot:Number ot:min = “1“ ot:min = “10“ /> <ot:Distribution ot:type="gauss" ot:mean="10900"
ot:variance="9200000"/> <ot:Regexp> ..</..> …
example template
<ot:Template ot:type="disjoint" ot:c="0.9"> <ot:Concatenation> <ot:Distribution ot:type="gauss" ot:mean="10900"
ot:variance="9200000"/> <ot:Stringlist> <ot:String ot:case="any">kc</ot:string> <ot:String ot:case="any">kč</ot:string> <ot:String ot:case="same">,-</ot:string> </ot:Stringlist> </ot:Concatenation> <ot:Context ot:side="left" ot:maxdistance="2" ot:p="0.6"> <ot:Template> <ot:String ot:case="any">cena</ot:string> <ot:String ot:case="any">cena:</ot:string> </ot:Template> </ot:Context> </ot:Template>
agenda
gathering information with wrappers ways to build a wrapper using and extending an ontology templates and patterns suggesting a simple wrapper induction
method
anotating the document
fisrt of all we can use the ontology as a model of the extracted data
then we would have to use the templates included in the ontology to identify possible example instances of the extracted values
theese examples can be used with any wrapper induction method
purifying the evidences
while every pattern has the precision attribute, we can say that up to (1-p)% of the template evidences can be false
we can make segments of the evidences based on thei absolute XPath
then we calculate the sum of confidences of all evidences in such a segment and ignore (1-p)% of the segments with the lowest sum
generalizing the segments
we generalize the segment using the variable index in the XPath
comparing the number of this generalized segment‘s elements with the original, we can use the completeness parameter to measure the probable error of such a generalization
matching the segments
we can match the segments of patterns of more datatype properties and form thus complex rules for extracting the instances of ontology classes
the matching can be based on the number of their elements or on the conformity of their XPath
future work suggestions
integration with some wrapper generation tool
automatic learnig of the patterns using other properties of ontologies,
such as cardinalities