Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science...

37
1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR TEXT TECHNOLOGY? Distribution of linguistic patterns (word, phrases, sentences); collocation; semantic prosody; synchronic/diachronic studies Empirical Models: Distribution of conceptual categories; acquisition; degeneration; Diachronic studies. Psychological/ Observational Models Distribution of grammatical categories; constituency; governance; synchronic studies Intuitive Models

Transcript of Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science...

Page 1: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

1

Text Analysis

Khurshid Ahmad, Professor of Computer ScienceDepartment of Computer Science

Trinity College Dublin-2, IRELAND

PREAMBLE:MODELS FOR TEXT TECHNOLOGY?

Distribution of linguistic patterns (word, phrases, sentences); collocation; semantic prosody; synchronic/diachronic studies

Empirical Models:

Distribution of conceptual categories;acquisition; degeneration; Diachronic studies.

Psychological/

Observational Models

Distribution of grammatical categories; constituency; governance; synchronic studies

Intuitive Models

Page 2: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

2

CORPUS LINGUISTICS

The aim of corpus linguistics is ‘to base accounts of language on corpora derived from systematic recordings of conversations and real discourse of other kinds, as opposed to examples obtained by introspection, by judgement of grammarians, or by haphazard observation’; and a corpus is defined ‘as any systematic collection of speech or writing in a language or variety of a language’ (Matthews 1997:78).

Matthews, P. H. (1997). Oxford Concise Dictionary of Linguistics. Oxford & New York: Oxford University Press.

Representative Corpora: The BNC

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written.

http://www.natcorp.ox.ac.uk/corpus/index.xml

Sample

General

Synchronic

Monolingual

The BNC is

Page 3: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

3

Representative Corpora: The BNC

The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.

http://www.natcorp.ox.ac.uk/corpus/index.xml

COMPRISES written sources: (a) samples of 45,000 words are taken from various parts of single-author texts; (b) shorter texts up to a maximum of 45,000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids over-representing idiosyncratic texts.

Sample

INCLUDESmany different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language.

General

COVERS British English of the late twentieth century, rather than the historical development which produced it.

Synchronic

DEALS with modern British English, not other languges used in Britain. However non-British English and foreign language words do occur in the corpus.

Monolingual

The British National CorpusCharacteristic

Representative Corpora: The BNC

http://www.natcorp.ox.ac.uk/corpus/index.xml

245.00%TOTAL

01.02%as, like, her, than, as, how, well, way, our, as

11.18%these, also, people, any, first, only, new, may, very, should

01.37%so, did, about, your, now, me, no, more, other, just

11.57%him, into, its, then, two, when, up, time, my, out

01.90%as, who, have, do, that, one, said, them, some, could

02.42%their, has, would, what, will, there, if, can, all, her

03.25%which, or, we, an, n't, 's, were, that, been, have

04.35%are, not, this, but, 's, they, his, from, had, she

06.66%i, for, you, he, be, with, on, that, by, at

021.28%the, of, and, a, in, to, it, is, was, to

No. of OCWCumulative

Relative

Frequency

Token

. Distribution of the first 100 most frequent tokens in the BNC according to the cumulative frequency of ten tokens at a time.

Page 4: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

4

Representative Corpora: The BNC

http://www.natcorp.ox.ac.uk/corpus/index.xml

. Distribution of the first 200 most frequent tokens in the BNC according to the cumulative frequency of ten tokens at a time.

296.09%TOTAL

60.43%system, local, during, most, although, next, small, case, great, things

30.46%went, ', came, after, children, always, four, without, one, within

30.49%why, each, while, part, on, number, out_of, made, different, really

20.54%never, under, one, most, old, over, know, something, mr, take

40.57%another, world, see, got, work, however, life, again, against, think

40.60%government, might, same, much, see, yes, go, make, day, man

10.65%oh, last, no, more, 'm, going, so, erm, after, us

10.70%'ll, must, still, even, know, too, here, get, own, does

40.77%'re, yeah, three, down, such, back, good, where, year, through

10.88%between, years, er, many, those, there, 've, being, because, do

No. of OCWCumulative

Relative

Frequency

Token

LANGUAGE AS A SYSTEM:INPUTS & OUTPUTS

InterpertamentTexts are responses to previous texts and the texts are then responded to in turn and the cycle continues � hence the diachronic dimension

Page 5: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

5

LANGAUGE AS A SYSTEM The moonlighting terms- Lexicogenesis?

Chemical Atoms: The smallest particles in which the elements combine, or are known to possess the properties of a particular element.

1819

Physical Atoms: The supposed ultimate particles in which matter actually exists (without reference to its stability).

1650

An atom is a hypothetical body, so small as to be incapable of further division; and thus to be one of the ultimate particles of nature.

1477

The orthodoxyYear

NYE, M.J. (1986). The Question of the Atom-From the Karlsruhe Congress to the 1st Solvay Congress. A compilation of primary sources. Los Angeles: Tomash Publishers.

The orthodoxyYear

Bohr's theory of atomic structure – The one great fan of Rutherford’s scattering experiments

1913

Rutherford's ‘nucleus’ theory- An experimentalist interpreting Nagaoka’s and Crookes’ observations

1909

Rayleigh’s infinite electron atom: An elaboration of Thomson’s atomic structure

1906

Nagaoka's 'Saturnian' atom reinterpreting Maxwell’s observations about the planet

1904

Thomson's atomic structure based on Aepinus one fluid theory of electricity

1899

Conn, G.K.T. and Turner, H.D. (1965). The Evolution of the Nuclear Atom. London: IliffeBooks Ltd, New York: American Elsevier Pub. Co.

LANGAUGE AS A SYSTEM The moonlighting terms- Lexicogenesis?

Page 6: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

6

Languages are constantly in flux

The corpus linguist explores the discourse as a system that can be explained without referring to a discourse external reality or to the mental state of the members of the discourse community.

Teubert, Wolfgang (2003). Writing, hermenutics and corpus linguistics. Logos and Language Vol.IV (no. 2) pp 1-17.

LANGUAGE AS A SYSTEM:INPUTS & OUTPUTS

LANGUAGE AS A SYSTEM: INPUTS & OUTPUTS

InterpertamentWhere will you find the evidence of use, definition, and elaboration of terms like:

• inclusive learning environment (e-Learning)• Borromean Halo Nuclei (Radioactive Nuclear Beam Physics)• honeycombed catalytic converter (Automotive Engineering)• indivualist weak supervenience (Philosophy of Science)

• indoor blood videotaping (Forensic Science)

EXCEPT IN A TEXT CORPUS?

Page 7: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

7

Language as a System:The moonlighting terms – Lexicogenesis?

Verschuuren, G. M. N. (1986). Investigating the Life Sciences: An Introduction to the Philosophy of Science. Oxford: Pergamon Press.

a contemporaneous phenomenon with borders between the species (Darwin)

an absolute phenomenon that has been determined in the past (Linnaeus)

Species:

The distinction between species

a compression during systole of the heart (Harvey)

an explosion during diastole of the heart (Descartes)

Heartbeat:

Blood circulation is caused by

The mass of the object increases by gaining oxygen from air (Lavoisier)

That the mass of the object decreases by losing phlogiston to air (Priestley)

Combustion:

The burning of an object means

a turning earth (Kepler)

a rising Sun (Brahe)Solar Cycle:

Sunrise is caused by

something exerts 'attraction' (Galileo)

an in-built tendency to move (Aristotle)

Motion:

Objects move because of

The new ‘truth’The old ‘truth’Term/‘Concept'

LANGUAGE & CHANGEDEVELOPMENT OF CONCEPTS: ATOM

I. In philosophical and scientific use.

In senses 2 and 3 now generally held to

consist of a positively charged nucleus, in

which is concentrated most of the mass of

the atom, and round which orbit negatively

charged electrons.1. A hypothetical body, so infinitely small as to

be incapable of further division; and thus held to be

one of the ultimate particles of matter, by the

concourse of which, according to Leucippus and

Democritus, the universe was formed.

2. In Nat. Phil. physical atoms: the supposed

ultimate particles in which matter actually exists

(without reference to their divisibility or the

contrary), aggregates of which held in their places

by molecular forces, constitute all material bodies.

3. chemical atoms: a. The smallest particles

in which the elements combine either with

themselves, or with each other, and thus the

smallest quantity of matter known to possess the

properties of a particular element. b. The smallest

quantity in which a group of elements, called a

radical, forms a compound corresponding to one

formed by a simple element, or behaves like an

element; thus the smallest known quantity of a

chemical compound.

Entry printed from Oxford English Dictionary Online © Oxford University Press 2001

II. In popular use.

4. From sense 1, as the nearest popular conception to the

atoms of the philosophers: One of the particles of dust which

are rendered visible by light; a mote in the sunbeam. arch. or

Obs.

1784 COWPER Task I. 361 The rustling straw sends up a frequent

mist of atoms. 1821 BYRON Two Foscari III. i, Moted rays of light

Peopled with dusty atoms.

5. The smallest conceivable portion or fragment of

anything; a very minute portion or quantity, a particle, a

jot: a. of matter.

c1630 DRUMMOND OF HAWTHORNDEN Poems (1633) 166 Like tinder

when flints atoms on it fall. 1644 DIGBY Nat. Bodies vi. (1658) 54 Little attoms

of oyl..ascend apace up the week of a burning candle. 1835 SIR J. ROSS N.-W.

Pass. xxxiv. 477 There was not an atom of water.

b. of things immaterial. logical atom: one of the essential

and indivisible elements into which some philosophers hold

that statements can be analysed.

1873 C. S. PEIRCE in Mem. Amer. Acad. Arts & Sci. IX. II. 343 The

logical atom, or term not capable of logical division, must be one of

which every predicate may be universally affirmed or denied... 1918

[see ATOMISM 1b]. 1958 G. J. WARNOCK Eng. Philos. since 1900 v.

54 Russell's world of indefinitely numerous, independent logical atoms

is the metaphysical opposite of Bradley's Absolute.

Page 8: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

8

LANGUAGE & CHANGE

DEVELOPMENT OF CONCEPTS: NUCLEUSPl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner

part, f. nucula or nuc-, nux nut. So F. nucleus, It., Sp., and Pg.

nucleo.]

I. 1. Astr. a. The more condensed portion of the head of a comet.

b. A more condensed, usu. brighter, central part of a

galaxy or nebula.

2. A supposed interior crust of the earth. 2. A supposed interior crust of the earth. Obs.Obs.

3. A central part or thing around which other parts or things are grouped, collected,

or compacted; that which forms the centre or kernel of some aggregate or mass.

a. Of material (esp. more or less solid) things.

b. Of communities or groups of persons.

c. Of immaterial things.

d. Of places, buildings, etc.

e. Of collections of things.

4. Archæol. A block of flint or other stone from which early implements have been

made. Entry printed from Oxford English Dictionary Online

© Oxford University Press 2001

LANGUAGE & CHANGE

DEVELOPMENT OF CONCEPTS: NUCLEUS

Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)

kernel, inner part, f. nucula or nuc-, nux nut. So F.

nucleus, It., Sp., and Pg. nucleo.]

II. 5. Botany a. The kernel of a nut. Now rare or Obs.

b. The kernel of a seed (see quots.).

c. The central part of an ovule.

d. In Lichens: (see quot. 1832).

e. In Fungi: (see quots.).

f. The hilum of a starch-granule.

Entry printed from Oxford English Dictionary Online

© Oxford University Press 2001

Page 9: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

9

LANGUAGE & CHANGE

DEVELOPMENT OF CONCEPTS: NUCLEUS

Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)

kernel, inner part, f. nucula or nuc-, nux nut. So F.

nucleus, It., Sp., and Pg. nucleo.]II. 6. a. The rudiments of the shell in certain molluscs.

b. Any discrete mass of grey matter in the central nervous system.

The term is used in numerous English and mod.L. combs.

distinguishing the various different nuclei.

7. Biol. A cell organelle present in most of the cells of all organisms except

the most primitive, usu. as a single subspherical structure, and consisting

(except when undergoing division) of a membrane enclosing a ground

substance (the nuclear sap) in which lie the chromosomes, one or more

nucleoli, etc., and functioning as the repository of genetic information and

as the director of metabolic and synthetic activity of the cell.

Entry printed from Oxford English Dictionary Online

© Oxford University Press 2001

LANGUAGE & CHANGE

DEVELOPMENT OF CONCEPTS: NUCLEUS

Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)

kernel, inner part, f. nucula or nuc-, nux nut. So F.

nucleus, It., Sp., and Pg. nucleo.]II.

8. Chem. An arrangement of atoms, esp. a ring structure,

characteristic of a number of organic compounds.

9. A particle on which crystals, droplets, or bubbles can

form in a fluid.

10. A small group of bees, including a queen, used as the

foundation of a new colony.

Entry printed from Oxford English Dictionary Online

© Oxford University Press 2001

Page 10: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

10

KNOWLEDGE & CHANGE

DEVELOPMENT OF CONCEPTS: NUCLEUSPl. nuclei and nucleuses. [a. L. nucleus(nuculeus) kernel, inner part, f. nucula or nuc-, nux nut. So F. nucleus, It., Sp., and Pg. nucleo.]II.

11. Physics. The positively charged central constituent of the atom, comprising nearly all its mass but occupying only a very small part of its volume and now known to be composed of protons and neutrons.

In Rutherford's 1911 paper called merely a ‘central charge’. In the examples in the first paragraph nucleus is used for various speculative notions concerning the atom.

Entry printed from Oxford English Dictionary Online

© Oxford University Press 2001

KNOWLEDGE & CHANGE

DEVELOPMENT OF CONCEPTS: NUCLEUS

Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)

kernel, inner part, f. nucula or nuc-, nux nut. So F.

nucleus, It., Sp., and Pg. nucleo.]II.

12. a. Phonetics. The syllable of a word (spoken in isolation)

that bears the primary accent; in an utterance, the syllable or

syllables given particular emphasis.

b. Linguistics. The main word or words in a combination, phrase,

or sentence; also = KERNEL n.1 8b.

Hence nucleus v. trans., to make into a nucleus, to concentrate.

1899 KIPLING Stalky 252 They'd withdrawn all the troops they could, but I

nucleused about forty Pathans.

Entry printed from Oxford English Dictionary Online

© Oxford University Press 2001

Page 11: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

11

– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.

– The same concept may be referred to by different names;

– The frequency of words in a text carry a signature –if the text is specialist then a select few terms are repeatedly used;

– Everyday, general language texts seldom carry a signature.

LEXICAL SIGNATURE?

TEXT, TEXTURE, TEXTUALITY

Etymologically, text comes from a metaphorical use

of the Latin verb textere – weave – suggesting a

sequences of sentences or utterances ‘interwoven’

structurally and semantically

A text can be regarded as sequential collection of

sentences or utterances which form a UNITY by

reason of their linguistic COHESION and

semantic COHERENCE. However, it is possible

for a text to comprise a single sentence, e.g. a

road sign.

Page 12: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

12

TEXT, TEXTURE, TEXTUALITY

New definitions appended from the OED 1993

text, n.1

Add: [1.] e. Short for TEXT-BOOK n. 2.

[2.] d. Linguistics. (A unit of) connected

discourse whose function is communicative

and which forms the object of analysis and

description. Cf. text-frequency, linguistics

TEXT, TEXTURE, TEXTUALITYCOHESION IN TEXT

Cohesion refers to the means by which sentences in a

text are linked to each other, to form a paragraph.

This linkage leads to larger units as well: paragraphs

in a chapter, chapters in a book. The sentences are

made to stick together.

The sentences in themselves are words stuck

together through the use of and, but, not and so

on. Sometimes, we use he,she, it, they…..for

sticking sentences. At other times we repeat

words, the same word, words related to the word,

sounds related to the word, substitutes for a word

Page 13: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

13

TEXT, TEXTURE, TEXTUALITY :COHESION IN TEXT

Two kinds of words which glue a text:

GRAMMATICAL WORDS;

Conjunctions: and, but, if, then

Pronouns: he, she, it, they, them

Prepositions: on, in, of

Modal verbs: verbs to be

LEXICAL WORDS (Repetition)

Nouns: Names of person, place and things

Adjectives: words that qualify a noun

Adverbs: Modifier of a verb

INFLECTIONS & DERIVATIONS:

Markers for plurals (car � cars)

Nominalisation: nouns verbs

(react � reaction)

TEXT, TEXTURE, TEXTUALITY :COHESION IN TEXT

LEXICAL WORDS (Repetition)Simple repetition (the word form + plurals):

Inflection: reaction � reactions

Complex Repetition: Derivation: react � reaction; reactant;

motor �motoringcrime � criminal

Paraphrase: Genus/Species/Instance:

Electrons/Protons � Particles �Building blocks of the Universe

Compounding: {stable, unstable, trans-Uranic, halo, compound, ..} +nucleus

forensic + {analysis, laboratory, technician, science…}

Page 14: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

14

TEXT, TEXTURE, TEXTUALITY :COHERENCE IN TEXT

If a text makes sense, then we can identify it as such.

Sentences may be connected together because they

refer to the same person, place, event or thing. The

connectivity, or sticking together, is provided by the

content or the meaning.

Coherence can be understood more if one looks at literary texts: often terms like plot, narrative, and narration are used to describe the unity of a given

literary text.

TEXT, TEXTURE, TEXTUALITYTextuality is a term used to denote the various

standards that a text – a collection of linguistic units -

should have in order to be regarded as a text.

There are many features: Cohesion and coherence being the

most prominent. The authors and the readers typically have

a plan or purpose when they respectively write and read a

text: This is called intentionality. Acceptability is a standard

which refers to the possible use a text may have for its

readers. A text is generally expected to comprise new

information (informativity). A text is typically related to

other texts and the readers of a text usually expect it to be

the case (intertextuality). A text is expected to have relevance

to the context (situationality).

Page 15: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

15

KNOWLEDGE & COMMUNICATION

Broadly the process of exchanging information or messages, and human language, in speech and writing, is the most significant and most complex communication system.

•A human language-based communications system is comparable to a machine (e.g. computer) based communications system: In 1949, Shannon and Weaver introduced an elegant theory of communication. Messages in Shannon and Weaver’s system are transmitted as signals from transmitter or sender to receiver via the medium of speech, for example, along the channel of sound waves.

•The human transmitter, however, is (usually) also the creator of the message; and what may be communicated may not only be factual, or even verbal, but also attitudinal, social or cultural information. Indeed, humans can communicate with each other when they are silent.

KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION

• Language can be viewed as 'a communicative

process based on knowledge. Generally when

humans use language, the producer and

comprehender are processing information, making

use of their knowledge of the language and of the

topics of conversation. Language is a process of

communication between intelligent active

processors, in which both the producer and the

comprehender(s) perform complex cognitive tasks.

Page 16: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

16

KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION

The producer has communicative goals, including

effects to be achieved,

information to be conveyed, and

attitudes to be expressed

• The comprehender attempts to understand (the meaning

of the producers communicative goals):

by reacting (verbally or non-verbally),

by inferring new information,

by updating existing data about processes or devices,

by focusing attention on something or some of its properties, or

by preparing for subsequent utterances of the producer

KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION

Producer

Current Goals

Cognitive

Processing

Knowledge Base

Knowledge of the

language

Knowledge of the

situation

Knowledge of the

world

Comprehender

Understood M eaning

Cognitive

Processing

Knowledge Base

Knowledge of the

language

Knowledge of the

situation

Knowledge of the

world

Medium

Speech

or

Writing

A model of co-operative communication

Page 17: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

17

An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION

A. Input corpusGL /* a general language corpus comprising NGL individual words*/

B. Input corpusSL /* a corpus of specialist texts comprising NSL individual words*/

C. Conduct a uni-variate analysis of the contrastive distribution of linguistic tokens in the two corpora: extract terminology, ontology

D. Conduct a multi-variate analysis of the tokens within specialist texts to find keywords by the extent to which each keywords accounts for the variance in texts

An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION

UNIVARAITE ANALYSIS• Input corpusGL /* a general language corpus comprising

NGL individual words*/• Input corpusSL /* a corpus of specialist texts comprising

NSL individual words*/• Contrast the distribution of words in corpusGLand

corpusSL

• Select Single Words based on z-scores for relative frequency and weirdness (equivalent to tfidf)

• Find collocation patterns for selected single words;• Find hyponymic patterns using textual markers;• Construct a local grammar using collocation and

hyponymic linksI. Generate a recursive transition network based on local

grammars.

Page 18: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

18

SPECIAL LANGUAGE

•The special language of focussed, single minded pursuits: Science, technology, sports, politics, philosophy,……

•A natural language privileges persons ; in contrast the “splinter of ordinary language”that we call [specialised] scientific discourse privileges a world of objects, processes, happenings, events.

•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.

•The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events

SPECIAL LANGUAGE

Page 19: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

19

•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.

GENERAL LANGUAGE

The Trial Franz Kafka (1916)

Chapter One Arrest - Conversation with Mrs. Grubach - Then Miss Bürstner

Someone must have been telling lies about Josef K., he knew he had done nothing wrong

but, one morning, he was arrested. Every day at eight in the morning he was brought his

breakfast by Mrs. Grubach's cook -Mrs. Grubach was his landlady - but today she didn't come. That had never happened before. K. waited a little while, looked from his pillow at the old woman who lived opposite and who was watching him with an inquisitiveness

quite unusual for her, and finally, both hungry and disconcerted, rang the bell. There was

immediately a knock at the door and a man entered. He had never seen the man in this

house before. He was slim but firmly built, his clothes were black and close-fitting, with many folds and pockets, buckles and buttons and a belt, all of which gave the impression of

being very practical but without making it very clear what they were actually for. "Who are you?" asked K., sitting half upright in his bed. The man, however, ignored the question as if

his arrival simply had to be accepted, and merely replied, "You rang?" "Anna should have

brought me my breakfast," said K. http://www.gutenberg.org/dirs/etext05/ktria11.txt Translation Copyright (C) by David Wyllie Translator contact email: [email protected]

•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.

GENERAL LANGUAGE

http://www.gutenberg.org/dirs/etext05/ktria11.txt Translation Copyright (C) by David Wyllie Translator contact email: [email protected]

106Max sentence length (words) :

17.83Average sentence length (words) :

82Sentence count :

1.44Average Syllables per Word :

4048Number of characters without spaces :

7901Total number of characters :

8.2Readability (Gunning-Fog Index) : (6-easy 20-hard)

59.7%Complexity factor (Lexical Density) :

429Number of different words :

718Total word count :

Page 20: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

20

•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.

GENERAL LANGUAGE

70.80%6much

61.00%7room

61.00%7time

61.00%7looked

51.30%9man

51.30%9them

41.40%10what

31.70%12him

21.90%14said

12.80%20you

RankFrequencyOccurrencesWord

http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=7849

The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events

SPECIAL LANGUAGE

45.8Readability (Alternative) beta : (100-easy 20-hard, optimal 60-70)

55Max sentence length (words) :

14.57Average sentence length (words) :

99Sentence count :

1.73Average Syllables per Word :

4690Number of characters without spaces :

8276Total number of characters :

8.4Readability (Gunning-Fog Index) : (6-easy 20-hard)

47.60%Complexity factor (Lexical Density) :

324Number of different words :

681Total word count :

Page 21: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

21

81.2%8particle

81.2%8elements

71.3%9nuclear

61.5%10electrons

51.8%12scattering

51.8%12atomic

42.5%17atom

32.6%18charge

22.9%20number

13.8%26nucleus

RankFrequencyOccurrencesWord

Nuclear Constitution of AtomsBakerian LecturebySIR E. RUTHERFORD, F.R.S.Cavendish (Professor of Experimental Physics, University ofCambridge). The Proceedings of the Royal Society, A, 97, 1920, pp.374-400

The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events

SPECIAL LANGUAGE

Mäki, Uskali. (2001). (Ed.) The Economic World View: Studies in the Ontology of Economics. Cambridge: Cambridge University Press.

Terminology, Ontology and Semantics: Theories and Things

Ontology and Metaphysics

Page 22: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

22

Mäki, Uskali. (2001). (Ed.) The Economic World View: Studies in the Ontology of Economics. Cambridge: Cambridge University Press.

Terminology, Ontology and Semantics:

Lexical Signature from Prof Amazon?

account action agents argument assumption behavior behaviour

beliefs between cambridge cannot case causal causes change

choice claim conditions constraint different does

economic economists economy empirical even

evolutionary example expectations explanation fact factors firm first

game general good however idea individual interest kind laws

level machines macroeconomics market matter may means

metaphysical might model must natural nature neoclassical new

ontological ontology part particular patterns people point possible

preferences press price principle probability problem properties

question rather rational real reason rosenberg sargent say science

see seems sense set should social terms theory things thus

two university use value view whether work world

>1000 AND <150016

>100 AND < 25012

>250 AND <50014

FrequencyFont

Size

Special language is a language used in a subject field and characterized by the use of specific linguistic means of expression.

http://stats.oecd.org/glossary/detail.asp?ID=6151

SPECIAL LANGUAGES

Page 23: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

23

SPECIAL LANGUAGES

Functionality and special languageMany researchers have purported to demonstrate that certain languages display particular features which are distinctly suited to serving a purpose required of that language or that have been melded by its use.

Strevens 1984Seaspeakoperational languages

Wikipedialanguage of military; police

Hoffman (1984language of commerce

Swales and Bhatia 1983legal language

Sager, Dungworth and McDonald 1981

language of sciencespecialist languages

John du Bois (1987)Sacapultec Mayanatural languages

ReferenceExemplarType

SPECIAL LANGUAGES:TEXT

TYPES

Imaginative Texts;

Informative Texts;

{Horatory Texts;}

{Instructive Texts;}

Page 24: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

24

SPECIAL LANGUAGES:WORD

CLASSES

Open Class;

Closed Class;

Additional Class: Numerals & Interjections

SPECIAL LANGUAGE

073Instructive: Reports

700Instructive: Manuals

190Informative: Official

4228Informative: Book - Monographs

0023Informative: Book- Theses

537330Informative: Journal Papers

083Imaginative: Letters

2480Imaginative: News

5311Imaginative: Popular Science

3800Imaginative: Adverts

Register

1622Other English Texts

363737American English Texts

955929British English Texts

Language Variety

13215868Total Number of Texts

326621472108688733Total No. of Tokens

Text Size

Automotive

Engineering

Nuclear

Physics

Theoretical

Linguistics

Page 25: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

25

SPECIAL

LANGUAGES:CHARACTERISITICS

Characteristics of special languages

Interlocking definitions;

Technical Taxonomies;

Special Expressions;

Lexical Density;

Syntactic Ambiguity;

Grammatical Metaphor;

Semantic Discontinuity

Halliday and Martin (1993:71-84)

SPECIAL

LANGUAGES:CHARACTERISITICS

3047.95%TOTAL

31.28%first, second, than, lexical, hierarchy, subject, when, like, however, different

71.37%rules, masculine, then, word, syntactic, given, x, feature, binding, feminine

51.51%b, morphology, plural, same, t, number, where, class, stem, so

51.74%would, singular, also, semantic, theory, languages, p, its, forms, see

21.99%some, they, may, between, more, example, e, c, only, mor

42.19%will, form, case, such, if, no, language, two, all, structure

22.67%one, but, agreement, noun, at, has, o, other, these, n

23.81%on, gender, an, from, have, s, can, or, there, nouns

6.92%are, be, this, we, it, which, with, by, not, i

24.47%the, of, in, a, and, is, to, that, as, for

No. of

OC

W

Cumulative

Rel.

Frequency

Token

Frequency distribution of the first 100 most frequent words in a linguistics corpus. Open class words (OCWs) are indicated through the use of bold type face.

Page 26: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

26

SPECIAL

LANGUAGES:CHARACTERISITICS

Frequency distribution of the first 100 most frequent words in nuclear physics corpus. Open class words (OCWs) are indicated through the use of bold type face.

3350.98%TOTAL

51.31%t, elements, range, about, first, system, other, however, interaction, matter

41.45%density, more, core, f, d, section, if, theory, order, may

81.61%electron, mev, states, calculations, structure, than, nucleon, data, q, mass

31.76%h, k, state, where, only, atoms, also, model, very, but

31.98%number, n, such, target, atom, j, cross, m, or, potential

42.19%b, two, scattering, been, phys, one, particle, particles, e, between

22.64%s, c, can, neutron, p, electrons, not, will, has, was

43.65%at, energy, it, an, nuclei, nucleus, have, r, these, nuclear

06.25%with, are, by, this, I, as, from, which, we, on

027.44%the, of, in, and, a, to, is, for, that, be

No. of OCWCumulative

Relative

Frequency

Token

SPECIAL

LANGUAGES:CHARACTERISITICS

Frequency distribution of the first 100 most frequent words in an automotive engineering corpus. Open class words (OCWs) are indicated through the use of bold type face.

3946.72%Cumulative Frequency

21.28%so, however, two, into, driving, converter, three, low, other, between

51.38%sensor, valve, road, systems, engines, during, diesel, they, would, used

41.55%abs, use, standards, gas, up, nox, its, hc, if, time

61.72%high, unleaded, co, braking, new, temperature, european, g, conditions, one

21.91%these, than, fig, more, no, but, catalytic, also, only, when

62.36%vehicles, speed, will, wheel, car, pressure, were, all, brake, been

52.79%have, not, can, vehicle, cars, s, has, air, emission, test

63.70%from, fuel, was, system, emissions, catalyst, control, an, exhaust, or

15.93%as, be, are, by, that, this, at, it, which, engine

024.11%the, of, and, to, in, a, is, for, with, on

No.

of

OC

W

Cumulative

Relative

Frequency

Token

Page 27: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

27

SPECIAL LANGUAGE TERMINOLOGY

Terminology.

1. refers to the usage and study of terms, that is to say wordsand compound words generally used in specific contexts.2. refers to a more formal discipline which systematically studies of the labelling or designating of concepts particular to one or more subject fields or domains of human activity, through research and analysis of terms in context, for the purpose of documenting and promoting correct usage. This study can be limited to one language or can cover more than one language at the same time (multilingual terminology, bilingual terminology, and so forth).

http://en.wikipedia.org/wiki/Terminology

SPECIAL LANGUAGE TERMINOLOGY

http://en.wikipedia.org/wiki/Terminology

Terminology is a subject in its own right with its theoretical formalism, methods, techniques and tools.

Terminologists:

analyze the concepts and concept structures used in a field or domain of activity

identify the terms assigned to the concepts establish correspondences between terms

in the various languages compile the terminology, on paper or in

databasesmanage terminology databases create new terms

Page 28: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

28

Simple Methodology

• Extract nouns and verbs from a source text

• Find classes in SUMO for the nouns and verbs

• Record a mapping as being either equal, subsuming or instance.

– type a single word that relates to the UBL term in the "SUMO term" or "English Word" text areas in the SUMO browser

• Create a subclass of SUMO if it's a subsuming mapping

• Add properties to the subclass

– reusing SUMO properties

– extending SUMO properties by creating a &%subrelation of an existing property

• Add English definition to the class

– define constraints that express how the subclass is more specific than the superclass

• Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously

Permission to reuse granted so long as this notice is not altered – Author: Adam Pease [email protected], 2003

Suggested Upper Merged Ontology

Simple Methodology

an ontology is a data modelthat represents a domain and is used to reasonabout the objects in that domain and the relations between them.

http://en.wikipedia.org/wiki/Ontology_%28computer_science%29#Domain_ontologies_and_upper_ontologies

ways that objects can be related to one another

Relations

properties, features, characteristics, or parameters that objects can have and share

Attributes

sets, collections, or types of objects

Classes

the basic or "ground level" objects

Individuals

Page 29: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

29

PREAMBLE:

Special LanguageA note on creativity or terminicide

31.6Cell

13.3Physics Today

4.0New Scientist

0.0Quality Newspaper

44.8Science

55.5Nature

Lexical DifficultySource

Donald Hayes (1992) ‘The growing inaccessibility of science’. Nature. Vol 356, pp 739-740

‘That science has become more difficult for nonspecialists to understand is a truth universally acknowledged’.The choice of words in a journal paper is very different to that in a quality newspaper – obscuring the work of the scientists.

Lexicogenesis: Diachronic Semantic Change

The establishment of the unstable nucleus (1990’s)

Page 30: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

30

– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.

– The same concept may be referred to by different names;

– The frequency of words in a text carry a signature –if the text is specialist then a select few terms are repeatedly used;

– Everyday, general language texts seldom carry a signature.

LEXICAL SIGNATURE?

LEXICAL SIGNATURE?

– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.

– The same concept may be referred to by different names;

– The frequency of words in a text carry a signature – if the text is

specialist then a select few terms are repeatedly used;

– Everyday, general language texts seldom carry a signature.

Texts in forensic science can be identified by the signature:

SINGLE TERMS:evidence, crime, scene,

forensic, police, identificationcase, court, analysis, time, information, blood

& COMPOUND TERMS:

crime scene, forensic evidence, court case, blood analysis,

earprint, fingerprint, crime scenes

Page 31: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

31

LEXICAL SIGNATURE?

– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.

– The same concept may be referred to by different names;

– The frequency of words in a text carry a signature – if the text is

specialist then a select few terms are repeatedly used;

– Everyday, general language texts seldom carry a signature.

Texts in all specialist domains show a few repeatedly used terms form the SIGNATURE. These terms are used PRODUCTIVELY – in plural form, as (heads of) compounds, and in derivative formsnucleus ���� nuclei (PL.),

nuclear (Adjective); stable/unstable/nuclei;halo/closed shell nuclei;nuclear force/reaction; nuclear matter

crime ���� crime, criminal, crimes, criminals, criminalistics, criminology, criminalist(s), criminological, criminalitycrime scene; crime of passion; property crime;

– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.

– The same concept may be referred to by different names;

– The frequency of words in a text carry a signature – if the text is

specialist then a select few terms are repeatedly used;

– Everyday, general language texts seldom carry a signature.

Texts in forensic science can be identified by the signature:

SINGLE TERMS:evidence, crime, scene,

forensic, police, identificationcase, court, analysis, time, information, blood

& COMPOUND TERMS:

crime scene, forensic evidence, court case, blood analysis,

earprint, fingerprint, crime scenes

BUILDING A THESAURUS

Page 32: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

32

2.1%

2.6%

2.7%

2.9%

6.2%

BNC:

Relative Frequency

1.1

1.0

1.0

1.2

1.1

SFSC/BNC:

WEIRDNESS

2.4%a

2.5%to

2.7%and

3.7%of

6.8%the

SFSC:

Relative Frequency

British National Corpus (BNC) = 100 Million words;

Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;

The 5 words have about the same distribution in the two corpora: These are the so-called closed class words, or grammatical words, and one may find these words with the same frequency as both corpora have English language texts. There is no weirdness in the use of these words in the Forensic Science corpus.

BUILDING A THESAURUS

0.028%

0.001%

0.007%

0.007%

0.021%

BNC:

Relative Frequency

9

473

40

57

22

SFSC/BNC:

WEIRDNESS

0.25%police

0.25%forensic

0.27%scene

0.40%crime

0.47%evidence

SFSC:

Relative Frequency

British National Corpus (BNC) = 100 Million words;

Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;

The 5 words do not have the same distribution in the two corpora: These are the so-called open class words, or lexical words. For every 22 instances of evidence in the Surrey corpus there is only one instance of this word in the BNC. And, forensic is most weird: 473 instances in the Surrey Corpus as opposed to only one in the BNC.

BUILDING A THESAURUS

Page 33: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

33

0.00002%

0.00001%

0%

0%

0%

BNC:

Relative Frequency

1263

634

SFSC/BNC:

WEIRDNESS

0.0146%ballistics

0.0139%pyrolysis

0.0115%accelerant

0.0137%earprint

0.0187%bitemark

SFSC:

Relative Frequency

British National Corpus (BNC) = 100 Million words;

Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;

The first three words DO NOT EXIST in the BNC: These are the so-called neologisms, or new words. Pyrolysis & ballistics both are also lesser used words in the BNC.

BUILDING A THESAURUS

BUILDING A THESAURUS

Collocation patterns – semantic prosody in Surrey Forensic Science Corpus

Page 34: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

34

An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION

SLGL

SLGL

Nf

fNweirdness

)1( +=

)10,1,1()0

,1

,0

(

)1

(_

:

0

10

1 10

2)_

(:.

0

_

:.

≡→

++≥

>∑

=

=

>

=

UkkMetrices

iUk

ip

j

ipPeakedness

U

j

ip

j

ip

iUSpreadColl

k

ij

fij

f

ijkStrengthColl

σ

Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’. In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2). Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760.Smajda, Frank. (1994). Retrieving Collocations from Text: Xtract. In (Ed.) Susan Armstrong(-Warwick). Using Large Corpora. Cambridge, Massachusetts & London, England: MIT Press. pp143-177.

British National Corpus (BNC) = 100 Million words;

Surrey Nanotube Corpus (SFSC) = 1.09 Million words;

BUILDING A THESAURUS

Collocate Freq -5 -4 -3 -2 -1 1 2 3 4 5

nanotubes 690 8 8 9 2 0 647 6 0 7 3

nanotube 252 3 2 2 0 0 229 2 1 5 8

single-

walled

77 0 0 1 1 75 0 0 0 0 0

aligned 94 1 1 3 5 74 0 1 1 3 5

mult iwalled 70 1 1 2 0 59 0 0 1 5 1

amorphous 58 1 1 6 0 46 0 1 1 0 2

atoms 51 1 2 0 1 0 42 0 1 3 1

Collocations with carbon (frequency of 1506) in the Surrey Nanoscale science corpus.

Page 35: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

35

British National Corpus (BNC) = 100 Million words;

Surrey Nanotube Corpus (SFSC) = 1.09 Million words;

BUILDING A THESAURUS

Collocations with Collocations with carbon nanotubes(frequency of 647) in the Surrey Nanoscale science corpus.

Collocate Frequenc

y

-5 -4 -3 -2 1 1 2 3 4

single-

walled

7 3 0 0 1 1 7 1 0 0 0 0

aligned 63 1 1 1 5 48 0 0 2 4

multiwalled 53 0 0 1 0 46 0 0 5 1

properties 60 1 4 15 32 0 0 0 6 2

multiwall 34 0 1 0 1 30 0 2 0 0

No.

Potential ‘Hyponymic’ Patterns

1 NP0 such as { NP1, NP2, ,…………….(and|or) NPn}

2 such NP0 as { NP1, NP2, ,…………….(and|or) NPn}

3 { NP1, NP2, ,……………., NPn} (and|or) other NP0

4 NP0 (including|especially) { NP1, NP2, ,.(and|or) NPn}

injury including broken bone, the bow lute, such as the Bambara ndang

An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION

Page 36: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

36

An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION

•This method has been successfully applied in recent years in the

synthesis of various metal nanostructures such asnanowires, nanorods, and nanoparticles.•Occasional multiwall carbon nanotubes and othercarbon nanostructures were also found following

annealing at higher (> °C) temperatures.•The present method will be extended to find and fix

nanoparticles including polymers, colloids, micelles, and hopefully biological molecules/tissues in solution. •This technique is promising because many different types of

nanowires, like nanotubes or semiconductor nanowires, are now synthetically available

British National Corpus (BNC) = 100 Million words;

Surrey Philosophy of Science Corpus = 1.042 Million words; 164 Texts, 1990-2000;

Journal Papers; Letters, Conference Announcements, Courses

BUILDING A THESAURUS & ONTOLOGY

Page 37: Text Analysis - scss.tcd.ie · 1 Text Analysis Khurshid Ahmad, Professor of Computer Science Department of Computer Science Trinity College Dublin-2, IRELAND PREAMBLE: MODELS FOR

37

PREAMBLE:

Special LanguageA note on creativity or terminicide

0.0Quality Newspaper

-19.3Fiction

-22.6Nat. History magazine(Ranger Rick)

-27.4Children’s fiction

-63.8Farm-workers talking to cows

-4.7Popular Science(Discover)

Lexical DifficultySource

Lexicogenesis: Diachronic Semantic Change

The establishment of the nuclear atom (1890-1930)

Bohr