Methods for the Automatic Construction of Topic Maps
description
Transcript of Methods for the Automatic Construction of Topic Maps
![Page 1: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/1.jpg)
Methods for the Automatic Construction of Topic Maps
Eric Freese, Senior Consultant
ISOGEN International
![Page 2: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/2.jpg)
Agenda
• Automatic Construction from Structured Documents
• Automatic Construction from Unstructured Documents
![Page 3: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/3.jpg)
Contextual Harvesting
• Markup can provide clues about the information within a document
• Largely dependent on semantic markup• Takes advantage of nesting within elements • Rules can be developed for harvesting data
to build topic map constructs– Rules could then be applied to similar types of
documents
![Page 4: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/4.jpg)
DTD/Schema Development to Support Harvesting
• DTDs like HTML are mostly useless to a harvesting system
• Flat structures make associations between elements more difficult
• New DTD/schema development should take possible knowledge harvesting into account
![Page 5: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/5.jpg)
Content-Based Harvesting
• Combination of contextual and natural language harvesting
• Text is parsed and clues within the text are used to harvest knowledge.
• HTML documents where labels are included in the text could be processed this way
![Page 6: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/6.jpg)
NLP Strategies
• Named Entity recognition– A list of entities (people, companies,
places, etc.) is defined– Programs parse a corpus of information to
identify entities – Limited to the completeness of the entity
list
![Page 7: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/7.jpg)
NLP Strategies – cont.
• Concept extraction– A list of key words can be defined much
like the named entity strategy– Common strings may also be identified and
suggested as new concepts– More processing intensive than named
entity
![Page 8: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/8.jpg)
NLP Strategies – cont.
• Taxonomic classification– Documents are analysed and classified
according to a human-defined taxonomy– Specialized programs must be developed
that are able to understand the taxonomy• Must also be able to process synonyms and
related concepts
![Page 9: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/9.jpg)
NLP Strategies – cont.
• Discourse analysis– Programs are developed that attempt to
understand the meaning of a text– Analyze the parts of speech using a
lexicons and rules to attempt to derive the meanings and usages of words
![Page 10: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/10.jpg)
Steps in Processing Natural Language
• Tokenization
• Part of Speech Tagging
• Bracketing
• Identification of useful structures
![Page 11: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/11.jpg)
Tokenization
• GOAL: Prepare text for processing by a natural language processing system
• Flowing paragraph text is broken into sentence units
• Sentence units are broken into word tokens• Word tokens are prepared for part of speech
processing– Contractions and other constructs receive special
processing– n’t, ’s
![Page 12: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/12.jpg)
Part of Speech (POS) Tagging
• GOAL: Identify the part(s) of speech for each word token
• Lexicon – list of words with the possible POS tags for each word
• EXAMPLE: – sound - NNP JJ NN VB
• The Sound of Music, Puget Sound• a sound decision• the sound of silence• sound the alarm
![Page 13: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/13.jpg)
POS Tagging – cont.
• POS tagging is difficult
• Exception processing often required for phrases and grouped words
• EXAMPLE:– Time flies like an arrow.– Fruit flies like an apple.
![Page 14: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/14.jpg)
Bracketing
• GOAL: Identify groupings of words into phrases and the hierarchical relationship of phrases to one another
• A set of rules is used to identify how different parts of speech and phrases can be combined to form larger phrases.
• EXAMPLE:– [NP, DT, JJ, NN]– Noun phrase can consist of a determiner followed by
adjective followed by a noun
![Page 15: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/15.jpg)
Benefits of the phased approach
• The separation of functions allows this approach to be applied to any language.– A lexicon is developed for the language– Rules for language construction are defined– Generic engine is able to process the data
• The separation of the lexicon and the rules base allows the model to be modified/improved as the corpus of text grows.
![Page 16: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/16.jpg)
Putting it all together
• EXAMPLE: The red ball rolled down the hill.
• Tokenization and POS tagging– The DT– red JJ NNP– ball NN– rolled VBD VBN– down RB IN RBR VBP JJ NN RP– hill NN
![Page 17: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/17.jpg)
Putting it all together – cont.• EXAMPLE: The red ball rolled down the hill.• Bracketing rules
– [S, NP, VP]– [NP, DT, JJ, NN]– [VP, VBD, PP]– [PP, IN, NP]– [NP, DT, NN]
• RESULT (using XML for bracketing):<S><NP><DT>The</DT><JJ>red</JJ><NN>ball</NN></NP><VP><VBD>rolled</VBD><PP><IN>down</IN><NP><DT>the</DT><NN>hill</NN></NP></VP></S>
![Page 18: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/18.jpg)
Harvesting Considerations
• GIGO rule in effect• The harvesting process only hastens topic map
construction• Only some of the topic map merging rules are
applicable– Limited prospect of meaningful subject identities
• Humans must still participate in the process of knowledge organization in order to maintain quality– Selective inclusion in the topic map/knowledge base
![Page 19: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/19.jpg)
Q & A
Questions or comments welcome at:
ISOGEN International
1611 W. County Road B, Suite 204
St. Paul, MN 55113 USA
Voice: 1.651.636.9180 - Fax: 1.651.636.9191
www.isogen.com
![Page 20: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/20.jpg)
Demonstrations
• SemanText – open source
• Ontopia Knowledge Suite – commercial
![Page 21: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/21.jpg)
Harvesting Knowledge using SemanText
• Contextual Harvesting
• Natural Language-Based Harvesting
![Page 22: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/22.jpg)
Contextual Harvesting in SemanText
• XML markup is used as in NLP to identify document structures
• Users write rules for harvesting information into topic maps structures
![Page 23: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/23.jpg)
Harvesting XML Content
• Data can be harvested from any XML document (including RDF) into a topic map without the need to develop specialized programs
• Specification language is based on Xpath• In future, will also record the location from
which the data was harvested as an occurrence
![Page 24: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/24.jpg)
Harvesting RDF Content• RDF structures can be harvested in order
to build topic maps• Demonstrates the possibility of
interoperability between the two models• Rules can be established for flavors of RDF
(e.g. Dublin Core, DAML/OIL)– Allows any document using the tagging
scheme to be harvested
• Only binary associations can be generated
![Page 25: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/25.jpg)
NLP in SemanText
• The initial lexicon included within SemanText is based on a lexicon derived from the Penn Treebank tagging of the Brown corpus (1 million+ words) and a very large sample from the Wall Street Journal (approx. 3 million words)
• SemanText provides the ability to identify and add new words to the lexicon through a GUI.
![Page 26: Methods for the Automatic Construction of Topic Maps](https://reader036.fdocuments.in/reader036/viewer/2022062723/56813b4d550346895da43e8b/html5/thumbnails/26.jpg)
NLP in SemanText – cont.
• SemanText uses a public domain parser to tokenize flowing text and identify the appropriate POS for each word token based on a set of bracketing rules.
• XML is used to denote the bracketing• This XML markup allows SemanText to
process natural language using its contextual-based harvesting capability