Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/...
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
3
Transcript of Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/...
![Page 1: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/1.jpg)
Learning to Extract Symbolic Knowledge from the World Wide Web
Changho Choi
Source: http://www.cs.cmu.edu/~knigam/
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum
Carnegie Mellon University, J.Stefan Institute
AAAI-98
![Page 2: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/2.jpg)
3/6/2001 Changho Choi, University at Buffalo 2
Abstract
Information onthe Web Unstandable to Human
????
KBExtract information
Knowledgable
![Page 3: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/3.jpg)
3/6/2001 Changho Choi, University at Buffalo 3
Introduction (#1/4)
Two types of inputsof the information extraction system Ontology
Specifying the classes and relations of interest For example, a hierarchy of classes including Person, Student,
Research.Project, Course, etc.
Training examples Represent instances of the ontology classes and relations
For example, a course web page for Course classes, faculty web pages for Faculty classes, this pair of pages for Courses.Taught.By, etc.
![Page 4: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/4.jpg)
3/6/2001 Changho Choi, University at Buffalo 4
ClassesRelations : value
![Page 5: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/5.jpg)
3/6/2001 Changho Choi, University at Buffalo 5
Introduction (#3/4)
Assumptions about the mapping between the ontology and the Web
1. Each instance of an ontology class is a single Web page, a contiguous string of text, or a collection of several Web pages.
2. Each instance of a relation is a segment of hypertext, a contiguous segment of text, or t he hypertext segment.
![Page 6: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/6.jpg)
3/6/2001 Changho Choi, University at Buffalo 6
Introduction (#4/4)
Three primary learning tasks Involved in extracting knowledge-base instances for the Web
1. Recognizing class instances by classifying bodies.
2. Recognizing relation instances by classifying chains of hyperlinks.
3. Recognizing class and relation instances by extracting small fields of text form Web pages.
![Page 7: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/7.jpg)
3/6/2001 Changho Choi, University at Buffalo 7
Experimental Testbed
Experiments Based on the ontology Classes:Department, faculty, staff, student, research_project,
course, other Relations: Instructors.Of.Course(251), Members.Of.Project(392),
Department.Of.Person(748) Data sets
A set of pages(4127) and hyperlinks(10945) from 4 CS dept. A set of pages(4120) from numerous other CS dept.
Evaluation Four-fold cross validation
3 for training, 1 for testing
![Page 8: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/8.jpg)
3/6/2001 Changho Choi, University at Buffalo 8
Statistical Text Classification
Process building a probabilistic model of each class using
labeled training data Classifying newly seen pages by selecting the class that
that is most probable given the evidence of words describing the new page.
Train three classifiers Full-text Title/Heading Hyperlink
![Page 9: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/9.jpg)
3/6/2001 Changho Choi, University at Buffalo 9
Statistical Text Classification
Approach the naïve Bayes, with minor modifications
Based on Kullback-Leibler Divergence Given a document d to classify, we calculate a score for each
class c as follows:
aryin vocabul ith wordw
y vocabular theof size the T
din wordsofnumber then
))|Pr(
)|Pr(log()|Pr(
)Pr(log)(
i
1
T
ii
iic dw
cwdw
n
cdScore
![Page 10: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/10.jpg)
3/6/2001 Changho Choi, University at Buffalo 10
Statistical Text Classification
Experimental evaluation
Actual
Predicted
course student faculty staff Research_project
department
other Accuracy
Course 202 17 0 0 1 0 552 26.2
Student 0 421 14 17 2 0 519 43.3
Faculty 5 56 118 16 3 0 264 17.9
Staff 0 15 0 4 0 0 45 6.2
Research_project 8 9 10 5 62 0 384 13.0
Department 10 8 3 1 5 4 209 1.7
Other 19 32 7 3 12 0 1064 93.6
Coverage 82.8 72.4 77.1 8.7 72.9 100.0 35.0
![Page 11: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/11.jpg)
3/6/2001 Changho Choi, University at Buffalo 11
Accuracy/coverage
Coverage The percentage of pages for a given class that are
correctly classified as belonging to the class accuracy
The percentage of pages classified into a given class that are actually members of that class
![Page 12: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/12.jpg)
3/6/2001 Changho Choi, University at Buffalo 12
Accuracy/coverage tradeoff
1. Full-text classifiers 2. Hyperlink classifiers 3. Title/heading classifiers
“Hyperlink information can provide strong knowledge.”
![Page 13: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/13.jpg)
3/6/2001 Changho Choi, University at Buffalo 13
First-Order Text Classification
Second approach for text classification : learn first-order rules for classifying pages 1st-order: rules with variables
FOIL is the well-known algorithm for first-order learning. 0th-order: no variables. Prolog-like. Function-free Horn
clauses C4.5 is the well-known algorithm for zeroth-order learning.
![Page 14: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/14.jpg)
3/6/2001 Changho Choi, University at Buffalo 14
FOIL’s input for text classification
For each distinct word, has_word(Page) word is stemmed.
For every hyperlink, link_to(Page, Page)
Training data, Student(“http://www.cs.buffalo.edu/grads.html”), … Course(“http://www.cse.buffalo.edu/courses.html”), … …
![Page 15: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/15.jpg)
3/6/2001 Changho Choi, University at Buffalo 15
FOIL’s result
Sample learned rules, Student(A) := not(has_data(A)), not(has_comment(A)),
link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)).Test Set: 126(+), 5(-)
Faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).Test Set: 18(+), 3(-)
![Page 16: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/16.jpg)
3/6/2001 Changho Choi, University at Buffalo 16
FOIL’s result
Comparing to statistical classification
More accurate Less coverage
![Page 17: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/17.jpg)
3/6/2001 Changho Choi, University at Buffalo 17
Classifying Hyperlinks
Use a first-order representation because this task involves discovering hyperlink paths
of unknown and variable size. and, since we want to find out following patterns.
“The ProjectMember(A,B) relation holds if A is a Person, and B is a ResearchProject, and B includes a link to A near the word ‘People’”.
![Page 18: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/18.jpg)
3/6/2001 Changho Choi, University at Buffalo 18
FOIL’s Input for classifying hyperlinks
Predicates: class(Page) link_to(Hyperlink, Page, Page) has_word(Hyperlink) all_words_capitalized(Hyperlink) has alphanumeric_word(Hyperlink) has_neighborhood_word(Hyperlink)
Training examples: Department.Of.Person(“CSE”, “Changho Choi”), … Instructors.Of.Course(“Sargur N. Srihari”, “CSE711”), …
![Page 19: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/19.jpg)
3/6/2001 Changho Choi, University at Buffalo 19
FOIL’s result
Sample learned rules, “Members_of_project(A, B) := research_project(A),
person(B), link_to(C,A,D), link_to(E,D,B), neighborhood_word_people(C).”Test Set: 18(+), 0(-)
“department_of_person(A,B) := person(A), department(B), link_to(C,D,A), link_to(E,F,D), link_to(G,B,F), neighborhood_word_graduate(E).”Test Set: 371(+), 4(-)
![Page 20: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/20.jpg)
3/6/2001 Changho Choi, University at Buffalo 20
FOIL’s result
Fairly High Accuracy
Limited coverage Because limited
coverage of page classifiers
![Page 21: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/21.jpg)
3/6/2001 Changho Choi, University at Buffalo 21
Extracting Text Fields
Uses a richer set of predicates length(Fragment, Relop, N) Some(Fragment, Var, Path, Attr, Value) Position(Fragment, Var, From, Relop, N) Relpos(Fragment, Var1, Var2, Relop, N)
Sample learned rule, “ownername(Fragment) := some(Fragment, B, [], in_title, true),
length(Fragment, <, 3), some(Fragment, B, [prev_token], word, “gmt”), some(Fragment, A, [], longp, true), some(Fragment, B, [], word, unknown), some(Fragment, B, [], quadrupletonp, false)”
![Page 22: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/22.jpg)
3/6/2001 Changho Choi, University at Buffalo 22
FOIL’s result
![Page 23: Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d535503460f94a2fa84/html5/thumbnails/23.jpg)
3/6/2001 Changho Choi, University at Buffalo 23
Conclusions
The approach we propose in this paper is to construct a system that can be trained to automatically populate such a KB.
We have presented a variety pf approaches that take advantage of the special structure of hypertext By considering relationships among Web pages, Their hyperlinks, And specific words on individual pages and
hyperlinks.