Overview of Research - Computational Terminology - Knowledge extraction from Text

26
CLiNG - May 24 2002 verview of Research Computational Terminology - Knowledge extraction from Text - Study of causal relation - Corpus building - Uncertainty Computer Assisted Language Learning (CALL) - Interdisciplinary project on French Second Language Text understanding - From speech to sentence

description

Overview of Research - Computational Terminology - Knowledge extraction from Text - Study of causal relation - Corpus building - Uncertainty - Computer Assisted Language Learning (CALL) - Interdisciplinary project on French Second Language - Text understanding - PowerPoint PPT Presentation

Transcript of Overview of Research - Computational Terminology - Knowledge extraction from Text

Page 1: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Overview of Research

- Computational Terminology- Knowledge extraction from Text- Study of causal relation- Corpus building- Uncertainty

- Computer Assisted Language Learning (CALL)- Interdisciplinary project on French Second Language

- Text understanding- From speech to sentence

Page 2: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

SeRT - a tool for knowledge extraction from text

Caroline Barrière

School of Information Technology and EngineeringUniversity of Ottawa

Ottawa, Ontario, [email protected]

Page 3: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

A few questions...

- Why knowledge extraction from text? For building a Knowledge Base...

- What’s a Knowledge Base?It depends who defines it....

- From a terminological standpoint: A static repository of domain-specific knowledge, giving the important concepts and their relations.

- What kind of relations?Hyperonymy (is-a), meronymy (part-of), synonymy,function, definition, causality

- Why start from text?What are the alternatives?

Page 4: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Semantic Relations in Text (SeRT)

- Goal : Starting from a corpus of texts on a specific domain, capture and store the important concepts (terms)

of that domain, as well as their relations.

- Hypothesis - definitions can be derived from text analysis - text is used as language and meta-language - paradigmatic relations can be found in texts by pattern search - present knowledge representation formalism allow the

representation of this information

Page 5: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

In clay soils, organic materials such as compost and pine bark increase drainage and airspace.

Some yard wastes, such as wood chips, are very difficult to compost fully and are thereforenot suitable for incorporation into the soil.

Grass clippings and other green vegetation tend to have a higher proportion of nitrogen(and therefore a lower C/N ratio) than brown vegetation such as dried leaves or woodchips.

To help meet that requirement, North Carolina passed l law that prohibits depositingorganic yard wastes such as leaves, grass clippings, or tree trimmings in the state'slandfills.

Table 2. Semantic relation hypernym found through the pattern such as and and other

Example of a pattern search for hyperonymy (Corpus on Composting)

Page 6: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

SeRT - Features

- parallel search of terms and relations- term extraction- search for surface patterns leading to semantic relations

- focus on user interaction (nothing fully automatic)- term selection and validation- user definition of surface patterns corresponding to semantic relations

- user selection of concepts involved (tuple) in the semantic relation

- raw text used (no preprocessing necessary)

- easy access to KB : save and retrieval

- to be used in “bootstrapping” mode

Page 7: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Term extraction

- Usage of a stop lista, able, about, above, according, accordingly, across, actually …

- appropriate method for English (but maybe not for French)satellite link - liaison par satellitelaser printer - imprimante au lasercommunication network - réseau de communication

- no syntactic analysis

- different from:Daille 1994: linguistic patterns (French)Bourigault 1994: morpho-syntactic markers (French)

- lemmatization'moving quickly' ‘mov[ing] quick[ly] 'mov* quick*

Page 8: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Results

- Corpus on “composting”

- Terms

503 compost373 pile258 composting202 soil170 materials155 material142 nitrogen110 compost pile103 water102 bin100 time92 leaves83 bacteria

402 compost369 pile199 soil187 composting149 material146 materials133 nitrogen105 compost pile102 bin96 time95 water94 Compost85 leaves

402 compost369 pile295 materi*260 compost*199 soil133 nitrogen105 compost pile105 temperatur*102 bin96 time95 leav*95 water94 Compost

Page 9: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Page 10: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Search for patterns indicating semantic relations

- pre-encoded patterns (earlier work - Barrière 1997) - find list from all other authors

- pattern search has multiple possibilities: - string matching - lemmatized token matching - part of speech matching - inclusion of a dictionary look-up (derived from Collins + morphological rules added)

- possibility of searching for a pattern around 1 term - usually what Computational Terminologists want to do

- display limited or enlarged context

Page 11: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Example of search patterns

Hyperonymy such as (string matching) and other *|n (string + POS) includ* *|n (lemmatized string + POS) *|n is a *|a of [~part] (negative filter) *|y organic materi* [mostly, especially, specifically]

(positive filter) + (search with specific term) Synonymy known as (string matching) also called (string matching)

Meronymy contains *|n (string + POS) is a *|a part of (string + POS)

Page 12: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Page 13: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

regular dictionary: 77,000 (1046 kb)aback,yabactinally,yabashedly,yabdominally,yabed,yabhorrently,y

irregular directory: 26,000 (387 kb)a',aablebodied,aablebodieder,aablebodiedest,aabranchial,aabranchialer,aabranchialest,a

entries with multiple POS: 94,000 (333 kb)roughcast,nvhuggermugger,anvybroadcast,anvyground,anvlike,acnrvycut,anvdraft,nv

TOTAL: 197,000 entries (1766 kb)

Page 14: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

public String[][] inflect = // plural nouns { { "", "s" }, { "", "es" }, { "y", "ies" }, { "an", "en" }, { "um", "a" }, { "", "e" }, { "us", "i" }, {... // comparative adjectives { "", "er" }, { "e", "er" }, { "y", "ier" }, { "c", "caler" }, { "", "der" }, {

Page 15: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Page 16: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Information storage in the TKB

- transfer of info found at previous step

- user selects the terms (concepts) around the pattern

- semantic relation / pattern / tuple are stored in the TKB

- an uncertainty factor can also be added to the tuple - research on causal relation has lead to realize the necessity of this information - applies to different relations

Page 17: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Semantic relation extraction

Page 18: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Fresh, young weeds from your irrigated garden can contain 60-70% moisture - noneed to add water to them.

Leaves from eucalyptus, walnuts, and laurel trees contain tannins.

Every piece of organic material contains carbon and nitrogen in differing ratios..

Most compost also contains as much as 2 percent calcium.

Table 1. Semantic relation meronym found through the pattern contain

Results - semantic relations

- Exploration of a few patterns- contain? (meronymy)- such as & and other (hypernymy)

Page 19: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

In clay soils, organic materials such as compost and pine bark increase drainage and airspace.

Some yard wastes, such as wood chips, are very difficult to compost fully and are thereforenot suitable for incorporation into the soil.

Grass clippings and other green vegetation tend to have a higher proportion of nitrogen(and therefore a lower C/N ratio) than brown vegetation such as dried leaves or woodchips.

To help meet that requirement, North Carolina passed l law that prohibits depositingorganic yard wastes such as leaves, grass clippings, or tree trimmings in the state'slandfills.

Table 2. Semantic relation hypernym found through the pattern such as and and other

Page 20: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

relation < meronym > relation <hypernym>tuple

(place 1)tuple (place 2) tuple (place 1) tuple (place 2)

60-70%moisture

young weeds compost organic material

tannins leaves from eucalyptus tree pine bark organic material

tannins leaves from walnut tree wood chips yard wastes

tannins leaves from laurel tree grass clippings green vegetation

carbon organic material dried leaves brown vegetation

nitrogen organic material wood chips brown vegetation

calcium compost leaves organic yard wastes

grass clippings organic yard wastes

tree trimmings organic yard wastes

Table 3. Possible relations extracted from a text

Page 21: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Could we infer is-a relations and extend the type hierarchy?

Page 22: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

SeRT use

- Parallel mode - searching on patterns can suggest terms to be explored - search on terms can suggest patterns around them

- Bootstrapping mode for relations- start with one pattern: enhance

- tuplet compost/soil found used to find other patterns

Page 23: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

As a soil amendment, compost is thought to enhance the physical, chemical, and biologicalproperties of soils.

When worm compost is added to soil, it boosts the nutrients available to plants and enhances soilstructure and drainage.

This discussion is an attempt to enhance your understanding of the conditions which can lead toodor formation, in the hopes that they can be avoided or at least minimized in the future.

No matter your soil type, your climatic zone, or your choice of crops, composting will enhance yourgarden soil, resulting in stronger plants and healthier produce.

Table 4. Sentences containing the verbal pattern enhance

Page 24: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Before using compost, be sure to study a copy of any soil or waste chemical nutrient analyses,pesticide and heavy metal analyses, and stability tests that the producer of the compost performed.

When worm compost is added to soil, it boosts the nutrients available to plants and enhances soilstructure and drainage.

How does compost help soil structure?

Some people get around the problem of nitrogen loss by adding bloodmeal to the soil before theybury the compost materials.

Composting is really quite simple, inexpensive, ecologically sound, and utterly failproof - no matterwhat you do, your pile wile eventually rot into soil-enriching compost!

While compost is a panacea for all garden soils, poor soils especially will benefit from consistentapplications.

Table 5. Some examples of the tuple "compost/soil" in the corpus

Page 25: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Future work Short term (tool itself)

- Add list of predefined relations & patterns

- Add flexibility in pattern search - toward a mix of semantic and syntactic search

- Construction of a graphical representation of the semantic network built

Page 26: Overview of Research - Computational Terminology - Knowledge extraction from Text

CLiNG - May 24 2002

Future work Long term (tool + theoretical background)

- Work on compound nouns - much implicit information that could be put explicitly in the KB

- Work on representational scheme - the relational database is too limiting - causal relation requires a different type of representation - contexts for expressing the relation (possibly nested) - uncertainty factors - inferencing

- Explore pattern search in French

- Batch mode extraction (no user) - automatic selection of terms around patterns

- after certain terms and patterns have been identified - need an integration of confidence levels on patterns