Terminology identification from full text: OCLCs WordSmith experience Jean Godby Senior Research...
-
Upload
alexa-wiley -
Category
Documents
-
view
212 -
download
0
Transcript of Terminology identification from full text: OCLCs WordSmith experience Jean Godby Senior Research...
Terminology identificationfrom full text: OCLC’s WordSmith
experience
Jean GodbySenior Research Scientist
OCLC Online Computer Library Center, Inc.
SOASIST Full-Day Workshop on Aboutness
June 21, 2001
Outline of this talk
• The need for terminology• Sources of terminology
• Extracting terminology from free text• Organizing it• Mapping it to library classification
schemes
Increasing subject access to document collections
More human effort Less human effort
More abstract view of the data Less abstract
Cataloging TokenizingClassification Indexing WordSmith
Scorpion
ClassificationResearch
Subject terminology fromlibrary classification
schemes
• Strengths– Derived from scholarship in subject analysis and
classification theory– Permits interoperability between Web resources
and traditional published materials
• Weaknesses– Literary warrant is based on traditional published
materials.– Human effort is required to keep them current.– They must be modified for use in automated
systems.– They aren’t free.
Subject terminology from full text
• Strengths– Literary warrant is based on current text.– Coverage is not restricted to traditionally
published material.– The style is closer to the user’s vocabulary.
• Weaknesses– The data is noisy and difficult to organize.
Terminology identification
• ...is an essential first step in the analysis of a document's content.
• ...is one of the most mature research subjects in natural language processing.
Lexical phrases
• Are the names of persistent concepts.
• Act like words.
• Are commonly used to name new concepts in rapidly evolving technical subject domains.
A lexical phrase:“Recurrent erosion”
Not a lexical phrase:“Recurrent problem”
Identifying lexical phrases
Tokenized text: ...Planetary scientists think the convex shape came about as lava welled up beneath the crater's solid floor….
Ngrams: planetary scientists think, convex shape, welled up, coincided with, five times greater than, easiest way, Milky Way, absolute magnitudes brighter than, added material, advanced study, African American
Index filter: planetary scientists, convex shape, easiest way, Milky Way, absolute magnitudes, added material, advanced study, African American
Topic filter: planetary scientists, Milky Way
Strategies in the topic filter
• Word/phrase frequency and strength of association
• “Knowledge-poor” text analysis
• More sophisticated but computable text analysis
Word and phrase frequencies
• Word/phrase frequencyhigh: dublin core, metadata, element, electronic resourceslow: availability period, background, applicable
terminologies
• Weighted frequency 1. core element, date element, metadata element 2. author name, entity name, corporate name 3. HTML tag, end tag, meta tag
Knowledge-poor techniques 1:parts of speech in local context
• Some noun phrase heads usually appear in text only with adjective or noun modifiers.
holes--black holes, grey holes, central holes
• Others usually appear without modifiers.
galaxy--cartwheel galaxies, spiral galaxy a galaxy, if galaxies; ...however, galaxy formation
Consequences
• We can identify topical single terms:
galaxy, star, sun, moon
government, abortion, communism metadata, html, Internet, information
• We can create subject taxonomies: galaxy (-ies) *hole(s) cartwheel galaxy black holes elliptical galaxy drill holes spiral galaxy grey holes
Knowledge-poor techniques 2: subject probes
• Goal: to get high-quality subject terms• Look for indications that something is talked about, written about,
or studied: topics in, study of, analysis of, (on the) subject of, major in, is called, is known as
• Probes differ in specificity. topics in sciences, arts, humanities, library science, astronomy,
physics, business, data visualization, computer science, mathematics, computer and network security, mathematics, number theory, medicine
analysis of metabolic regulation, numerical analysis, saline water phenomena, coals, iron ore, cereal grains, income dynamics among men, working hours, inflation, mass belief systems, aerial photography
More clues can be identified with “knowledge-rich” processing
You can sum up the big difference between beans on the one hand and Java applets and applications on the other in one word (okay, two words) : component model. Chapter 2 contains a nice, thorough discussion of component models (which is a pretty important concept, so I devoted an entire chapter to the subject).
Java Beans for Dummies. Emily Vander Veer. Chicago, IL: IDG Books Worldwide. 1997, p. 14.
Some results
Terminology lists: tokenizing vs. indexing
havehaveihavelhavenhavenshaverahavertyhaveyhavice
havill havilland
health carehealth care coveragehealth insurancehousinghousing policy…….world tradeworld trade accordworld trade agreementworld trade centerworld trade center
bombing
Terminology extraction works best with:
• Full text
• Collections of text, not isolated documents
• Text from a single subject domain
• Algorithms that are tuned to the style of the text
An application: browse displays
Organizing terminology
Dewey Decimal
Dewey
Deweycall
numbers
Dewey numbers
Deweydecimal
classificationnumbers cutter
numbers
B/N
B/N
B/N
Broad/Narrow
DDC
DDC and LCSH
Library of Congress Subject Headings
SubjectHeadings
Ellipsis
Acronym
Coordination
Acronym
B/N
An application: a topic map for a collection of
Web resources
Another application: a terminology server
Mapping vocabulary to library classification
schemes
• Explicit– For each document in a collection, extract
terminology using WordSmith.– Assign Dewey Decimal Classification (DDC)
numbers using Scorpion.– Identify the highest associations between
extracted terms and DDC numbers.
• Implicit– Make both sources of subject information
available in a user interface.
Terminology mapping works best when:
• The upstream processes for extracting terminology are clean.
• It operates on a large collection of domain-specific text.
• The classification scheme is simplified.
The Desire database of Web documents about
engineering
Science aspects
Social science aspects
Links to documents about other types of pollution
In sum
• We can automatically extract useful terminology from full text.
• The terminology can be embedded in applications of varying complexity.
• There is a tradeoff between accuracy and technical sophistication.
For more information
Godby, Jean and Reighart, Ray. 1998. “The WordSmith indexing system..” Accessible at:http://www.oclc.org/oclc/research/publications/review98/godby_reighart/
wordsmith.htm
Godby, Jean; Miller, Eric; and Reighart, Ray . 2000. “Automatically generated topic maps of World Wide Web resources.” Accessible at:http://www.oclc.org/oclc/research/publications/review99/godby/topicmaps.htm
Godby, Jean and Reighart, Ray. 2001. “Terminology identification in a collection of Web resources. In: K. Calhoun and J. Riemer, eds. CORC: New tools and possibilities for electronic resource description. New York: The Hayworth Press, Inc., 49-66.