Closing the Gap between Corpora and Termbases, CHAT2013

38
© 2013 by Termologic. All rights reserved. 1 Developing effective terminological resources for commercial use or Narrowing the gap between termbases and corpora in commercial environments Kara Warburton CHAT 2013

description

Presenter: Kara Warburton (ISO TC 37 Chair, Termologic) This presentation is a part of TaaS project funded from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 296312

Transcript of Closing the Gap between Corpora and Termbases, CHAT2013

© 2013 by Termologic. All rights reserved. 1

Developing effective terminologicalresources for commercial use

orNarrowing the gap between

termbases and corporain commercial environments

Kara WarburtonCHAT 2013

© 2013 by Termologic. All rights reserved. 2

1. Motivation of my research

2. Does commercial language contain terminology?

3. What is a term?

4. Our assumption about termbases and corpora

5. Aim and methodology of our research

6. Corpus-valid terms

7. The gap between termbases and corpora

8. Causes of the gap

9. Keywords and their potential

10.Conclusions

© 2013 by Termologic. All rights reserved. 3

Personal motivation

● The established principles and methodologies of terminology management don't seem to “fit” the needs of commercial uses of terminology

How do I resolve this apparent conflict?

● Study how terminology is managed in commercial settings

● Identify key issues, gaps with mainstream methodology and theory

© 2013 by Termologic. All rights reserved. 4

Mainstream theory and practice

Commercial needs

Strict ties to translation Restricted focus of termbases

Polyvalent

Normative Prescriptive and descriptive

Onomasiological Largely semasiological

Thematic Ad-hoc

Univocal Multivocal

Objectivist, concept focus Communicative, language focus

Philosophical, social concern Commercial concern

© 2013 by Termologic. All rights reserved. 5

What is terminology?

● (Terminology is) the science studying the structure, formation, development, usage and management of terminologies in various subject fields

● (A terminology is a) set of designations belonging to one special language.

● (A special language is) a language used in a subject field and characterized by the use of specific linguistic means of expression.

● (A subject field is) a field of special knowledge.

(ISO 1087-1, 2000)

© 2013 by Termologic. All rights reserved. 6

According to these definitions

● An LSP (special language) contains terminology.

● Key criteria for LSP:

– Subject field

– Specific linguistic means of expression

● Therefore:

– If commercial language is an LSP, then it contains terminology.

– Commercial language is an LSP if it:● can be viewed as a type of subject field● has specific linguistic means of expression

© 2013 by Termologic. All rights reserved. 7

What is a subject field? What is “special” knowledge?

● Pure and applied sciences, techniques, technologies, specialized activities

● Professional activities carried out in business, industry, companies, and professional settings

● Any specialized activity carried out by humans

© 2013 by Termologic. All rights reserved. 8

What are “specific linguistic means of expression”?

● Textual characteristics

– concision, precision, depersonalisation, economy, referentiality, preponderance of nominal structures, dominance of written form

● Communicative situation: formal, professional

● Communicative purpose

– inform, educate

– objective, precise, concise, and unambiguous exchange of information

● Conscious acquisition

© 2013 by Termologic. All rights reserved. 9

Commercial language is an LSP

● Describes tangible products, services, and activities, often within one vertical industrial or economic sector, which could be viewed as a subject field

● Adheres to specific linguistic rules and styles; many companies have a style guide, some are automatically implementing the style rules through controlled authoring software

● Written form predominates

● Informative purpose

© 2013 by Termologic. All rights reserved. 10

● General theory: the designation of an object the conceptualiza-tion of which can be classified into a system of concepts

● Socio-cognitive theory: a natural language representation of a unit of understanding, considered relevant to given purposes, applications, or groups of users

● Lexico-semantic theory: a construct that takes shape through an analysis which gives consideration to corpus evidence, subject-matter relevance, and the purpose of the terminographical product

● Textual theory: a semantically-charged linear structure that contributes to texture (coherence and cohesion) in an LSP text

● Communicative theory: all the above

What is a term?

© 2013 by Termologic. All rights reserved. 11

What is a term for commercial terminography?

● Semantic membership in a subject field is a guiding criterion

● But bringing benefit to the company is the primary criterion

● Companies have diverse needs, requiring diverse types of terminological resources

● ......

© 2013 by Termologic. All rights reserved. 12

Applications of terminology

● computer-assisted translation

● controlled authoring

● content management, automatic content classification

● product classification

● indexing, SEO, keyword management, etc....

EACH of these applications requires a HIGH LEVEL of correspondence between the termbase and the company corpus.

© 2013 by Termologic. All rights reserved. 13

A term is...

● ANY lexical unit that can bring benefit to the company by being “managed” is a candidate “term”. This MAY include:

– General lexicon words

– Phrases

– TM segments

– Proper nouns

– Variants

– Non-nouns, especially verbs

© 2013 by Termologic. All rights reserved. 14

Aim of our research

● Compare termbases and corpora in four IT companies to see how well they (the terms) correspond

● Establish the scope of the gap

● Explain the gap

● Identify ways to reduce the gap

© 2013 by Termologic. All rights reserved. 15

Methodology of the research

● Obtain termbases in export files from 4 different systems

● Convert to TBX

● Import to TermWeb

● Apply necessary filters for different evaluations

● Obtain and prepare corpora for analysis

● Export corpus-valid terms from termbase

● Run batch concordance of termbase terms

● Statistically analyze results

● Identify patterns, investigate solutions, including keywords and DICE ranking

© 2013 by Termologic. All rights reserved. 16

Profile of the companies

● HQ in USA but global presence, all in IT sector

● Company A

– Field: statistics

– 330 employees

– Across language server, CrossDesk, CrossAuthor, xMetal, CrossTerm

● Company B

– Field: business analytics

– 13,000 employees

– Acrolinx IQ,in-house CAT tools, TermWeb

© 2013 by Termologic. All rights reserved. 17

Profile of the companies

● Company C

– Field: information security, storage, management

– 18,500 employees

– SDL WorldServer, Acrolinx IQ

● Company D

– Field: hardware (PCs, servers, printers, networking), software

– 330,000 employees

– SDL Trados and MultiTerm

© 2013 by Termologic. All rights reserved. 18

Size of data

Corpus size in tokens

Terms from termbase

Size of corpus in relation to termbase

1 3,973,265 1,777 2,236

2 19,808,928 6,441 3,075

3 22,136,564 4,195 3,074

4 400,777 4,385 91

© 2013 by Termologic. All rights reserved. 19

The gap between termbases and corpora

35%

63%

73%

76%

Range 0 + A:

Company A

Company B

Company C

Company D

© 2013 by Termologic. All rights reserved. 20

Causes of the gap

A) Under-performing termbase terms

– termbase terms that are absent or are infrequent in the corpus (generally, redundant terms)

B) Under-documented corpus terms

– corpus terms that are either entirely missing from the termbase (nonextant terms) or are in insufficient number in the termbase (infrequent terms).

© 2013 by Termologic. All rights reserved. 21

Under-performing termbase terms

● Upper-case terms

● Excessively long terms

● TM segments

● Terms with unessential modifiers (boundary setting problem)

● Terms with proper name modifiers

© 2013 by Termologic. All rights reserved. 22

Term boundary problems

Nonextant term Adjusted term Frequency

bad cluster cluster 8,490

automatic incremental backup incremental backup 521

sequential mean squares mean squares 129

absolute correlation coefficient correlation coefficient 330

individual fitted values fitted value 270

active data source data source 7,201

critical success factor component critical success factor 540

printhead failure printhead 275

© 2013 by Termologic. All rights reserved. 23

The cost of redundant terms

● IT cost

● Reduced efficiency due to dilution of “good” entries

● Cost of creating and maintaining the entries

© 2013 by Termologic. All rights reserved. 24

Under-documented corpus terms

● Variants

● Non-nouns (particularly verbs)

● Homographs

● Terms with optimally-set boundaries

● Multi-word terms containing a keyword

● Adjectives that are productive in forming MWTs

© 2013 by Termologic. All rights reserved. 25

Keywords

● A word that is unusually frequent, therefore, likely a domain-specific unigram term

● Determined by comparing word-frequency lists from a domain corpus and a general purpose (reference) corpus

● Good indicators of the key topics of a corpus

© 2013 by Termologic. All rights reserved. 26

Keyword categorization

● High-ranking - highly domain-specific

– data, plot, syntax, command, string, server● Mid- and Low-ranking - potential for domain-specific

homographs

– worm, cloud, wizard, key, boot● Keywords that are absent from or are extremely rare in the

reference corpus - less frequent but also highly domain-specific

– dotplot, ODBC, toolbar, widget, spyware, phishing

© 2013 by Termologic. All rights reserved. 27

Keywords as nodes of MWTs

● Highly productive in forming domain-specific multi-word terms (MWTs)

● Have been successfully leveraged in term extraction research

● Successful search techniques include raw collocate frequency and DICE collocate relationship measure

© 2013 by Termologic. All rights reserved. 28

Verbs

assessoverlayremovereplicatesavefailchooseenter

displaycustomizegagedeletecalculatecensoreditforecastplot

runreturnenableupdatecomputetickdeletedesign

specifyselectcreateclickplotdisplayaccesstest

© 2013 by Termologic. All rights reserved. 29

Key findings

● Unigrams and bigrams make up the vast majority of

termbase terms that occur frequently in the corpus.

● Terms that present the situation of homonymy are

important to document in a termbase

© 2013 by Termologic. All rights reserved. 30

Key findings

● Verbs and adjectives are under-documented in termbases

● Termbases are underoptimized when it comes to

documenting frequent domain-specific terms. Only three to

eight percent of the termbase terms occur very frequently,

and only 13 to 17 percent of their termbase terms occur

frequently. Only one company managed to include a

moderate level of frequent terms in its termbase (37

percent).

© 2013 by Termologic. All rights reserved. 31

Conclusion

● For commercial terminography, the notion of termhood

needs to take into account not only the traditional semantic

criteria but also pragmatic and purpose-driven criteria.

● Terminography serving commercial purposes needs to be

more corpus-driven.

● This is not “terminology” in the traditional sense. It includes

various types of lexical resources.

© 2013 by Termologic. All rights reserved. 32

Backup slides

© 2013 by Termologic. All rights reserved. 33

In-house pracices

● All 4 companies are interested in using terms in controlled authoring (CA); 3 companies are doing so already.

● Only one company (A) maintains all its data in a single termbase. The other companies maintain separate termbases for various purposes, such as CA and CAT. Company D has 15 termbases.

● Company B maintains 3 separate termbases: CA, CAT, and Authoring/Publishing aid.

● Company C uses automatic term extraction.

● Company D imports TM segments into the termbase to compensate for technical limitations of TM matching.

© 2013 by Termologic. All rights reserved. 34

Most common termbase data categories

● Definition

● Part of speech

● Process status

● Usage status

● Term type

© 2013 by Termologic. All rights reserved. 35

Most common termbase problems

● None of the companies use subject fields

● Only 2 companies consistently mark the part-of-speech

● There are widespread violations of:

– Term autonomy

– Concept orientation

– Data elementarity/granularity

© 2013 by Termologic. All rights reserved. 36

Corpus-valid termbase terms

● A notion defined strictly for the purposes of our research

● Terms that we “count” to measure of the gap between the termbase and the corpus

● Terms in the termbase that can reasonably be expected to occur in the corpus

● Does not include terms with negative usage markers (do not use, deprecated, etc.)

● Does not include general lexicon words, due to application specificity for controlled authoring, reduced terminological “interest”, and high expected number of concordances

© 2013 by Termologic. All rights reserved. 37

Example of a filter for corpus-valid terms

© 2013 by Termologic. All rights reserved. 38

● Termbase verbs occur 26,000 times

● Keyword verbs occur 90,000 times