Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical...

23
Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps

Transcript of Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical...

Page 1: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Carol Jean Godby

OCLC

Online Computer Library Center

March 6, 2001

The automatic encoding of lexical knowledge

in RDF topicmaps

Page 2: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Topicmaps of Web resources

• For navigating complex Web sites

• For managing bookmark files

• For creating views of the Web that are organized by subject

Page 3: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Terminology identification

• ...is an essential first step in the analysis of a document's content.

• ...is one of the most mature research subjects in natural language processing.

Page 4: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Lexical phrases

• Are the names of persistent concepts.

• Act like words.

• Are commonly used to name new concepts in rapidly evolving technical subject domains.

Page 5: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Not a lexical phrase:“Recurrent problem”

Page 6: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

A lexical phrase:“Recurrent erosion”

Page 7: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Identifying lexical phrases

Tokenized text: ...Planetary scientists think the convex shape came about as lava welled up beneath the crater's solid floor….

Ngrams: planetary scientists think, convex shape, welled up, coincided with, five times greater than, easiest way, Milky Way, absolute magnitudes brighter than, added material, advanced study, African American

Index filter: planetary scientists, convex shape, easiest way, Milky Way, absolute magnitudes, added material, advanced study, African American

Topic filter: planetary scientists, Milky Way

Page 8: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Terminology identification: process flow

Tokenized text

Ngrams

Index filter

Topic filter

9.8m

1.9 M

59k, 2331 phrases

35k, 1632 phrases

Page 9: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Strategies in the topic filter

• Word/phrase frequency and strength of association

• “Knowledge-poor” text analysis

• More sophisticated but computable text analysis

Page 10: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Word and phrase frequencies

• Word/phrase frequency

high: dublin core, metadata, element, electronic resources

low: availability period, background, applicable terminologies

• Weighted frequency

1. core element, date element, metadata element

2. author name, entity name, corporate name

3. HTML tag, end tag, meta tag

Page 11: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Knowledge-poor techniques 1:

• Some noun phrase heads usually appear in text only with adjective or noun modifiers.

Example: holes--black holes, grey holes, central holes

• Others usually appear without modifiers.

Example: galaxy--cartwheel galaxy, spiral galaxy

a galaxy, our galaxy, this galaxy

Page 12: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Consequences

• We can identify topical single terms:

galaxy, star, sun, moon

government, abortion, communism

metadata, html, Internet, information

• We can create subject taxonomies: galaxy (-ies) *hole(s)

cartwheel galaxy black holes

elliptical galaxy drill holes

spiral galaxy grey holes

Page 13: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Knowledge-poor techniques 2: subject probes

• Goal: to get high-quality subject terms• Look for markers of a subject that is talked about, written about or

studied: topics in, study of, analysis of, (on the) subject of, major in…

• Probes differ in specificity. topics in sciences, arts, humanities, library science, astronomy, physics,

business, data visualization, computer science, mathematics, computer and network security, mathematics, number theory, medicine

analysis of metabolic regulation, numerical analysis, saline water phenomena, coals, iron ore, cereal grains, income dynamics among men, working hours, inflation, mass belief systems, aerial photography

Page 14: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Some results

Page 15: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

The identification ofterm relationships

Singular/Plural: Library, libraries

AcronymsStandard Generalized Markup Language--SGML

Library of Congress Subject Headings--LCSH

Coordinationlibrary and information science--library science, information science

information storage and retrieval--information storage, information retrieval

cataloging and interlibrary loan--cataloging, interlibrary loan

Ellipsisabbreviated key title--abbreviated title

authority file records--authority records

Page 16: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

A more abstract relationship: hypernym/hyponym

• “…electronic formats, such as text/HTML, ASCII, or PostScript ….”

• Other examples from our data:

Controlled Vocabularies: Medical Subject Headings, Art and Architecture Thesaurus

metadata element set: Dublin Core

protocol server applications: NFS server, FTP server, Web server

moving images: films, videos, simulations

Page 17: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

A graph representation of relationships

Dewey Decimal

DeweyDewey

call numbers

Dewey numbers

Deweydecimal

classificationnumbers cutter

numbers

B/N

B/N

B/N

Broad/Narrow

DDC

DDC and LCSH

Library of Congress Subject Headings

SubjectHeadings

Ellipsis

Acronym

Coordination

Acronym

B/N

Page 18: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

RDF Topic Representation

Numbers

http://r2http://r1

“numbers”

name

isDefinedIn

http://r3

broad

narrowDewey numbers

“Dewey Numbers”

name

isDefinedIn

Page 19: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

1. Harvest Web text.

2. Extract terminology and relationships.

3. Organize terminology into an RDF graph.

4. Import the RDF graph into the Extended Open RDF Toolkit.

System flow 1: processing steps

Page 20: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

System flow 2: User interaction

User

RDFsearch engine

TheWeb

RDFConcept

graph

Page 21: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

A screen shot

Page 22: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

Future plans

• Develop a user interface that fully exploits the richness of the RDF graph structure.

• Merge terminology extracted from source documents with other sources of infermation.

• Improve processes for automatically extracting terminology.

Page 23: Carol Jean Godby OCLC Online Computer Library Center March 6, 2001 The automatic encoding of lexical knowledge in RDF topicmaps.

References

• The Extended Open RDF ToolkitAccessible at:

http://eor.dublincore.org/

• “Automatically generated topic maps of World Wide Web resources.”Accessible at:

http://www.oclc.org/oclc/research/publications/review99/godby/topicmaps.htm

• “The WordSmith indexing system”Accessible at:

http://www.oclc.org/oclc/research/publications/review98/godby_reighart/wordsmith.htm