The Conceptual Aspects of Terminographic Definitions ... · The Conceptual Aspects of...

23
Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 The Conceptual Aspects of Terminographic Definitions: Towards Automatic Genus Tagging Selja Seppälä Terminology Multilingual Information Processing Department Modules M1-M2, Approches formelles et cognitives du langage

Transcript of The Conceptual Aspects of Terminographic Definitions ... · The Conceptual Aspects of...

Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007

The Conceptual Aspects of Terminographic Definitions: Towards

Automatic Genus Tagging

Selja Seppälä

TerminologyMultilingual Information Processing Department

Modules M1-M2, Approches formelles et cognitives du langage

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 2

Outline

• Definitions• Background and open questions• Methodological approach• Case study: towards automatic genus

tagging• Conclusions and perspectives

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 3

What is terminology?

• An applied activity: writing mono- or multilingual dictionaries for specialised domains (sciences, activities, practices, etc.)

• A scientific discipline: study of terminological phenomena on a linguistic or a conceptual level, or both

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 4

What is a terminographic definition?

• A linguistic representation of a concept of a specialised domain

peptidyltransferase = An enzyme located on the large ribosomal subunit that catalyzes peptide bond formation.

• A synthesis of knowledge rich contexts (Meyer, 2001)

• It reflects the structure of the concept• Is subject to some formal restrictions

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 5

What is a genus?

• Relates defined & superordinate concepts(peptidyltransferase = An enzyme…)

• Different types:– Conceptual relation to defined concept

• IS_A (Rebeyrolle 2000)• PART_OF (Iris, et al. 1988; L'Homme 2003)• SET_OF (L'Homme 2003)

– Conceptual category (An enzyme = MATERIAL ENTITY)

• Different forms:– simple unit (Protein…)– complex unit (A chemical…, The set of modifications…)

• In French, often the 1st word (65% of test corpus)

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 6

Open questions for work

• What makes a context more relevant than another to define a certain concept?

• Since the definition is rather limited in space, what are the characteristics of the concept to be included in the definition?

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 7

Hypotheses

• Property selection related to:– type of conceptual category

– type of domain– language

• Can be studied through conceptualstructure of definitions

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 8

Conceptual structure of a definition

peptidyltransferase{An enzyme} [located on the large ribosomal subunit] [that catalyzes peptide bond formation.]

[PEPTIDYLTRANSFERASE] IS_A{GENMATERIAL_ENTITY} [SPELOCATION] [SPEFUNCTION]

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 9

Methodological approach

• Corpus study of terminographic definitions– complying to generally accepted definition

writing rules– conceptually annotated

• conceptual category(A peptidyltransferase = MATERIAL_ENTITY)

• conceptual relations– between the genus and the defined concept

(peptidyltransferase IS_A enzyme)

– between the specific and the genus(that catalyzes peptide bond formation = FUNCTION)

• Requires conceptual parsing

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 10

Case study: Identification of the task

• Find the genus element in a definition (semi-)automatically

• Mark it with an XML tag including the corresponding relation to the defined conceptpeptidyltransferase<GEN relation_VE="IS_A">An enzyme</GEN>located on the large ribosomal subunit that catalyzes peptide bond formation.

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 11

Genus extraction: State of the artObjectives: information retrieval to build:• lexical resources for NLP, ontologies, knowledge bases…

(Alshawi 1987, Ide & Véronis 1993, Markowitz et al. 1986)

• terminological resources (Pozzi & Medina 2005, Rebeyrolle 2000)

Text types:• lexicographic definitions (Barnbrook 2002, Ide & Véronis 1993, Markowitz et

al. 1986)

• terminological definitions (L’Homme 2003, Pozzi & Medina 2005)

• free text (Auger 1997, Cartier 2004, Copestake 1990, Rebeyrolle 2000)

Methods:• based on (boundary) markers (Barnbrook 2002, Rebeyrolle 2000)

• statistical methods (Pozzi & Medina 2005)

• use of external resources (POS tagger…) (Vossen et al. 1989)

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 12

Case study: How to find a genus?

Search for its boundarieson the basis of:• fixed phrases (part of,

group of...)

• morphosyntacticpatterns (present & past participles, relative pronouns...)

Finding regularities to form extraction patterns• in litterature (Iris et al. 1988;

L'Homme 2003, Rebeyrolle 2000…)

• with a concordancer(GEN + end of following word; "Type of " + 1 to 3 words…)

By taking advantage of:1. the definition sublanguage (Barnbrook 2002)

– fixed phrases

– present & past participles

– characteristic morphosyntactic & lexical items

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 13

Case study: How to find a genus?

By taking advantage of:2. the hierarchical structure of the domain

molecule = A structural unit of matter consisting of one or more atoms.

nucleic acid = A molecule formed by nucleotides located on one strand or two.

Search for terms of the domain (genus proximus)

Creating a list of terms

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 14

Case study: Implementation

• Training corpus: 500 definitions fromdifferent domains

• Test corpus: 92 definitions from the terminology of protein biosynthesis in eukaryotic cells (Bourjault, 2005)

• Perl program• Regular expressions

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 15

Case study: The processing method

Four ordered steps:

1. Insert opening tag ⇒ Search for probable specific elements preceding the genus

<SPE>The first</SPE><GEN>phase of translation </GEN>that…

2. Find GEN including termsfrom the domain

<GEN>An enzyme</GEN> located on…

…exoneukaryotic cellenzymeelongation…

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 16

Case study: The processing method

3. Search for closing boundary markers:• specific rules: fixed phrases

<GEN>The set of rules</GEN>used for…

• general rules: morphosyntactic & lexical markers

<GEN>An enzyme</GEN> located for…<GEN>A protein</GEN> that acts…

4. Tag 1st word of unmarked definitions

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 17

Case study: Performance evaluation

• Baseline : tag the 1st word (65%)

• Performance of the method: 78/92 (85%)• Raised mainly by:

– term search– fixed phrases

• Errors due to:– absence of fixed phrases indicating: 10/14

• PART relation: 9/10 (Branch of… [a science])

• WHOLE relation: 1/10 (An assembly of…)

– inexact extraction patterns: 4/14 (too greedy)

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 18

Case study: Main challenges

• Adequately refining the extraction patterns (avoid greedyness, find multiple genus…)

• Finding the best ordering of the rules• Discriminating lexicalized compound

words from free word sequences(A polypeptide chain… vs. A linear chain…)

• Collecting all forms of fixed phrases for each relation (can be domain specific)

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 19

Conclusion

• Manually created and classified rulesperform well

• Advantage: does not require external NLP resources

• Method can be used in other languages• Basis for automating:

– rule definition

– rule classification

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 20

Perspectives

• Adapt GEN processing method to SPEs• Apply method to other languages• Study conceptual regularities in definitions

to find patterns according to:– concept type– domain

– language

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 21

Thank you for your attention!

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 22

References• Alshawi, H. (1987), "Processing dictionary definitions with phrasal pattern

hierarchies", in Computational Linguistics, vol. 13, n°3-4, pp. 195-202.• Auger, A. (1997), Repérage des énoncés d'intérêt définitoire dans les bases

de données textuelles, Université de Neuchâtel, Faculté des Lettres.• Barnbrook, G. (2002), Defining Language : A local grammar of definition

sentences, John Benjamins, Amsterdam, Philadelphia.• Bourjault, A. (2005), Terminologie de la biosynthèse des protéines chez les

cellules eucaryotes : anglais-français, Mémoire de DESS en terminologie, Université de Genève.

• Cartier, E. (2004), Repérage automatique des expressions définitoires : modélisation de l'information définitoire, méthode de repérage automatique, méthodologie de développement des ressources linguistiques, description des expressions du français contemporain, mise en oeuvre informatique, Thèse de Doctorat, Paris-IV Sorbonne (décembre 2004).

• Copestake, A. (1990), An approach to building the hierarchical element of a lexical knowledge base from a machine readable dictionary, Paper presentedat the First International Workshop of Inheritance in NLP, Tilburg.

• Ide, N. et Véronis, J. (1993), "Extracting knowledge bases from machine-readable dictionaries : Have we wasted our time?" in KB&KS'93 Workshop, Tokyo

Selja Seppälä Programme doctoral CUSO 2007 en Sciences du Langage - Leysin, 19-22 mars 2007 23

References 2• Iris, M. A. et al. (1988), "Problems of the part-whole relation", in Evens M. W., Relational Models of the Lexicon: Representing Knowledge in Semantic Networks, Cambridge University Press, Cambridge.

• L'Homme, M.-C. (2003), "Indices de relations conceptuelles dans les définitions terminologiques. Application au domaine de l’informatique", in, Bach C. et Martí J.I Jornada Internacional sobre la Investigación en Terminología y Conocimiento Especializado, Barcelona, IULA.

• Markowitz, J., et al. (1986), "Semantically significant patterns in dictionary definitions", in Proceedings of the 24th conference on Association for Computational Linguistics, pp. 112-119.

• Pozzi, M. and Medina, A. (2005), "Towards the Establishment of Criteria for Automatic Genus Extraction in Specialised Dictionaries", in TKE Workshop: "Working with Terminology and KMS", TKE2005 - 7th International conference on Terminology and Knowledge Engineering, August 19, 2005, Copenhagen.

• Rebeyrolle, J. (2000b), "Utilisation de contextes définitoires pour l’acquisition de connaissances à partir de textes", in, Tchounikine P.IC'2000 - ACTES DE LA CONFERENCE Journées francophones d'Ingénierie des Connaissances, Toulouse, IRIT, 10-12 mai 2000.

• Vossen, P., et al. (1989), "Meaning and structure in dictionary definitions", in Boguraev B. et Briscoe T., Computational Lexicography for Natural Language Processing, Longman, London, New York