Text Processing with Finite State Transducers in Unitex

ANALYSIS OF IMAGES, SOCIAL NETWORKS, AND TEXTSApril, 9-11th, 2015, Yekaterinburg

Text Processing with Finite StateTransducers in Unitex

Artem Lukanin

This work is partially supported by the RFH grant #13-04-12020“New open electronic thesaurus for Russian”.

http://about.me/alukanin

What is Unitex?• An open-source corpus processor, based on automata-oriented

technology

• mainly developed by Sébastien Paumier at the Institut Gaspard-Monge

(IGM), University of Paris-Est Marne-la-Vallée (France)

• It works on Windows, Linux, Mac OS and other systems

• It has lexical resources for French, English, Greek, Portuguese, Russian,

Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more

• http://www-igm.univ-mlv.fr/~unitex/

2

http://www-igm.univ-mlv.fr/

http://www-igm.univ-mlv.fr/~unitex/

What is corpus?A corpus is a collection of pieces of language text in electronic form, selected

according to external criteria to represent, as far as possible, a language or

language variety as a source of data for linguistic research.

Sinclair 2005“

3

What is Finite State Transducer (FST)?FST, is a type of finite automaton which maps between two sets of symbols.

We can visualize an FST as a two-tape automaton that recognizes or

generates pairs of strings. Intuitively, we can do this by labeling each arc in

the finite-state machine with two symbol strings, one from each tape.

Jurafsky 2000

“4

Simple sentence splitting FST

... в четвертичном периоде. Достигали высоты ...

... в четвертичном периоде. {S} Достигали высоты ...

5

Get your corpus from a text file in Unitex1. Run Unitex

• If you are working on Windows, the program will ask you to choose a

personal working directory, which you can change later in

Info>Preferences...>Directories .

2. Select Russian as your working language

• For each language that you will be using, for the first time the

program will copy the root directory of that language to your

personal directory, except the dictionaries.

6

Get your corpus from a text file in Unitex3. Open corpus-ru-dbpedia-short-dea-1000.csv from the

Corpus subfolder: Text > Open...

4. Preprocess the text

• Apply Sentence.grf in MERGE mode

• Apply Replace.grf in REPLACE mode

• Tokenize the text

• Apply all default dictionaries

• Analyze unknown words as free compound words

7

Preprocessing• Sentence.grf splits the text into sentences, adding {S} tag before

the next sentence (language dependent)

• Replace.grf removes ¬ (soft hyphen) and converts no-break spaces

to spaces

• The standard separators (the space, the tab and the newline characters)

are normalized

8

Tokenization• is language (alphabet) dependent

• Newlines in a text are replaced by spaces

• A token can be:

• the sentence delimiter {S}

• the stop marker {STOP} to delimit texts

• a lexical tag, e.g. {ЮУрГУ,.N+ORG+gen(M)}

• a contiguous sequence of letters (from alphabet.txt )

• one (and only one) non-letter character, e.g. a digit

9

Applying dictionaries• consists of building the subset of dictionaries consisting only of forms

that are present in the text

• The corpus becomes "tagged", i.e. every token is assigned all possible

grammatical forms

• e.g. семью assigned these lexical tags:

семью,семья.N+anim(j)+gen(F):aeF

семью,.ADV

семью,семь.NUM+plur:t

10

Hyponyms and hypernymsUnlike synonymy and antonymy, which are lexical relations between word

forms, hyponymy/hypernymy is a semantic relation between word meanings:

e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} .

Much attention has been devoted to hyponymy/hypernymy (variously called

subordination/superordination, subset/superset, or the ISA relation)...

“11

Hyponyms and hypernymsA concept represented by the synset {x, x′,...} is said to be a hyponym of the

concept represented by the synset {y, y′,...} if native speakers of English accept

sentences constructed from such frames as An x is a (kind of) y. The relation

can be represented by including in {x, x′,...} a pointer to its superordinate, and

including in {y, y′,...} pointers to its hyponyms.

Miller 1993

“12

Hyponym and hypernym mining fromRussian textsМамонты — вымерший род млекопитающих из семейства

слоновых, живший в четвертичном периоде.{S} Достигали

высоты 5,5 метров и массы тела 10—12 тонн.{S}

Таким образом, мамонты были в два раза тяжелее самых

крупных современных наземных млекопитающих —

африканских слонов .

13

IndicatorsМамонты — вымерший род млекопитающих из семейства

слоновых, живший в четвертичном периоде.{S}

1. Text > Locate pattern...

2. Type род into Regular expression

3. Select Index all utterances in text in Search limitation

4. Click Search

14

Concordance

• hyponyms and hypernyms are nouns

• вымерший (participle) and широколиственных (adjective) can be

omitted

15

Patterns in Unitex1. Text > Locate pattern...

2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N>

3. Click Search

2 matches

Мамонты — вымерший род млекопитающих из семейства слоновых

Бук — род широколиственных деревьев семейства Буковые

01.

02.

16

Lexical masks• <род> : matches all the entries that have род as canonical form

• <стать.V> : matches all entries having стать as canonical form and

the grammatical code V

• <V> : matches all entries having the grammatical code V

• {стану,стать.V} or <стану,стать.V> : matches all the entries

having стану as inflected form, стать as canonical form and the

grammatical code V

17

Lexical masks. Special symbols• <E> : the empty word or epsilon. Matches the empty string

• <TOKEN> : matches any token, except the space; used by default for

morphological filters

• <MOT> : matches any token that consists of letters

• <MIN> : matches any lower-case token

• <MAJ> : matches any lower-case token

• <PRE> : matches any token that starts with a capital letter

18

Lexical masks. Special symbols• <DIC> : matches any word that is present in the dictionaries of the text

• <SDIC> : matches any simple word in the text dictionaries

• <CDIC> : matches any composed word in the dictionaries of the text

• <TDIC> : matches any tagged token like {XXX,XXX.XXX}

• <NB> : matches any contiguous sequence of digit (1234 is matched but

not 1 234)

• <#> : prohibits the presence of space

19

Graphs in Unitex• can match text (Finite State Automata)

• can produce new output text (Finite State Transducers)

• in MERGE mode combine the matched input text and the output text

(useful fot tagging)

• in REPLACE mode convert the matched input text into the output

text

20

1. FSGraph > New

2. Click on the initial state (arrow), click inside the empty place while

holding Ctrl to create a new box, connected to the initial state, type <N> ,

press Enter

21

A graph for matching text3. Create a — box, connected to the <N> box

4. Create a род box, connected to the — box

5. Create a <N> box, connected to the род box

6. Click on the second <N> box, click on the final state (a circle with a

square inside) to connect these 2 boxes

7. Create a <V:S> box between the — and род boxes

8. Create a <A>+<!DIC> box between the род and <N> boxes

9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save

22

A graph for matching text

Text > Locate Pattern... , Locate pattern in the form of: Graph, Set

match-hyponyms.grf , Search

23

Transducers in Unitex1. Click on the first <N> box (hyponym) and change it to <N>/{[ to add

{[ before the matched noun, when the graph is applied in the MERGE

mode

2. Click on the <N>/{[ and click on the — box to disconnect these boxes

3. Create a <E>/]=HYPONYM} box between the <N>/{[ and — boxes.

It will add ]=HYPONYM} after the matched noun

4. Modify the second <N> box for adding a HYPERNYM tag to it

24

Transducers in Unitex5. Save the graph as tag-hyponyms.grf

25

Tagging hyponyms and hypernyms1. Text > Locate pattern...

2. Set tag-hyponyms.grf

3. Select Merge with input text in Grammar outputs

4. Click Search

5. Build concordance

• The matched and tagged texts are stored in the concord.ind

file in the corpus folder

corpus-ru-dbpedia-short-dea-1000_snt

26

Tagging hyponyms and hypernyms{[Мамонты]=HYPONYM} — вымерший род

{[млекопитающих]=HYPERNYM} из семейства слоновых

{[Бук]=HYPONYM} — род широколиственных

{[деревьев]=HYPERNYM} семейства Буковые

• We can then use some script to extract tagged hyponyms and

hypernyms...

• or mine them right in Unitex in the REPLACE mode

01.

02.

27

Mining hyponyms and hypernyms1. Open match-hyponyms.grf : FSGraph > Open...

2. Click on the first <N> box, right-click on it and select

Surround with > Morphological mode

3. Click on the first <N> box and change it to <N>/$hyponym$ to store

the matched noun with all morphological information in the

$hyponym$ variable

28

Mining hyponyms and hypernyms4. Modify the second <N> box to store the matched noun in variable

$hypernym$ in the morphological mode

5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the

final state

6. Save this graph as mine-hyponyms.grf

7. In Info > Preferences... > Morphological dictionaries add

Dela/CISLEXru_igrok.bin

29

Mining hyponyms and hypernyms

30

Mining hyponyms and hypernyms1. Set this graph in Text > Locate pattern...

2. Select Replace recognized sequences in Grammar outputs

3. Click Search

млекопитающее: мамонт

дерево: бук

дерево: бука

дерево: Бук

01.

02.

03.

04.

31

Mining hyponyms and hypernyms1. Why so many Бук outputs? Let's see in the dictionary: DELA >

Lookup... , select CISLEXru_igrok.bin and enter this word

Бук,.N+FAMN+PN+anim(o)+gen(M):neM

Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:dm

бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom

бук,.N+anim(j)+gen(M):neM:ajeM

32

Mining hyponyms and hypernyms2. Let's modify mine-hyponyms.grf to remove ambiguous outputs:

change the first <N> box to <N~PN:n>

2 outputs

млекопитающее: мамонт

дерево: бук

01.

02.

33

References1. Jurafsky, D., & James, H. (2000). Speech and language processing an

introduction to natural language processing, computational linguistics,

and speech.

2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990).

Introduction to wordnet: An on-line lexical database*. International

journal of lexicography, 3(4), 235-244.

34

References3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est

Marne-la-Vallée. January 15, 2015,

http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf

4. Sinclair, J. (2005). "Corpus and Text - Basic Principles" in Developing

Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow

Books: 1-16. Available online from

http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01].

35

http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf

http://ahds.ac.uk/linguistic-corpora/

Text Processing in Unitex• PatternSim (github.com/cental/PatternSim) — a tool for calculation

semantic similarity between words from a text corpus based on lexico-

syntactic patterns

• Normatex (github.com/avlukanin/normatex) — Russian text normalization

for speech synthesis, machine translation and other natural language

processing tasks

• Unitext Tutorial (github.com/avlukanin/unitextutorial) — the slides and

source files used in this tutorial

36

https://github.com/cental/PatternSim

https://github.com/avlukanin/normatex

https://github.com/avlukanin/normatex

Text Processing with Finite StateTransducers in UnitexArtem Lukanin

• about.me/alukanin

• @avlukanin

• [email protected]

Slides: artyom.ice-lc.com/slides/unitextutorial

37

http://about.me/alukanin

http://twitter.com/avlukanin

mailto:[email protected]

http://artyom.ice-lc.com/slides/unitextutorial/

Text Processing with Finite State Transducers in Unitex

Education

Transcript of Text Processing with Finite State Transducers in Unitex