Monday Afternoon

Monday Afternoon

Review

Introduction to Natural-Language Morphology

Relations and Transducers

Introduction to xfst

Basic Review

What is a Symbol?

What is an Alphabet?

What is a string?

What is a word?

What is a Language?

What basic operations can be performed on Sets?

What basic operations can be performed on Languages?

Formal Languages and Natural Languages Any set of strings is a language (in the formal sense)

L1 = { “a”, “aa”, “aaa”, “aaaa”, “aaaaa”, …}

L2 = { “zzmy”, “niwhiuhew”, “sjehuiwheu” }

L3 = { “dog”, “cat”, “elephant”, “zebra” }

The systems that we write will “accept” or “map” words in a formal language.

In practical natural-language processing, we try to make these formal languages as close as possible to a natural language, e.g. Swahili. I.e. we try to model a natural language, as perfectly as possible.

We write our grammars and rules, to model a natural language, using lexc and xfst.

Concatenation can form “Real” Words

work talk walk

Root Language

0 ing ed s

Suffix Language

work working worked works talk talking talked talks walk walking walked walks

The concatenation of the Suffix language after the Root language.

Concatenation can also form Bad Words

try plot wiggle

Root Language

ing ed s

Suffix Language

Raw Concatenation Result/Level/Language:

*trys *tryed trying plots *ploted *ploting wiggles *wiggleed *wiggleing

tries tried trying plots plotted plotting wiggles wiggled wiggling

Desired Final Result/Level/Language

We call the discrepancies between the levels “alternations”

InuktitutParis+mut+nngau+juma+niraq+lauq+si+ma+nngit+junga

“I never said that I wanted to go to Paris”

Pari mu nngau juma nira lauq si ma nngit tunga

Paris ‘Paris’

niraq declare that

lauq past

si perfective

ma resulting state

nngit negative

mut terminalis-case

nngau direction-to

juma want

junga 1P pres. indic

Concatenative-Agglutinative (Aymara)

“also they are in your house”

Lexical: uta+ma+na-ka+pxa+raki-i+waSurface: uta ma n ka pxa rak i wa

uta = house (noun stem)

+ma = 2nd person possessive (your)

+na = in (case suffix)

-ka = locative (also verbalizes)

+pxa = plural

+raki = also

-i = 3rd person present tense

+wa = topic (primary emphasis)

Morphology In most languages, morphemes are just concatenations of symbols from the alphabet

of the language.

In most languages, words are just concatenations of morphemes.

But raw concatenation often gives us abstract, morphophonemic, not-yet-correct words.

There are alternations between the raw concatenations and the desired final words.

There are two challenges in natural-language Morphology:

Morphotactics: the study of word-formation Alternation: mapping between raw concatenations and final forms

We claim that both can be modeled and computed using finite-state methods. We use lexc for morphotactics; xfst for alternations rules.

Transducers

Recall that finite-state transducers can “map” from one string of

symbols to a different string of symbols.

c a n t a r [Verb] [PInd] [1P] [Sg]

c a n t є є є o є є

We can also use transducers to map between abstract, not-yet-correct forms (usually built by simple concatenation) and correct surface forms.

w i g g l e i n g

w i g g l є i n g

Regular Relations A Regular Language is a set of strings, e.g. { “cat”, “fly”, “big” }.

An ordered pair of strings, notated <“upper”, “lower”>, relates two strings, e.g. <“wiggleing”, “wiggling”>.

A Regular Relation is a set of ordered pairs of strings, e.g.

{ <“cat+N”, “cat”> , <“fly+N”, “fly”> , <“fly+V”, “fly”>, <“big+A”, “big”> }

The set of upper-side strings in a relation is a Regular Language.

The set of lower-side strings in a relation is a Regular Language.

A Regular Relation is a “mapping” between two Regular Languages. Each string in one of the languages is “related” to one or more strings of the other language.

A Regular Relation is encoded in a Finite-State Transducer (FST).

Relations, Analysis and Generation

Given a transducer (relation), and a string, we can see the mappings of the relation via Analysis and Generation:

c a n t a r [Verb] [PInd] [1P] [Sg]

c a n t ε ε ε o ε ε

Upper-side string: c a n t a r [Verb] [PInd] [1P] [Sg]

Lower-side string: c a n t o

Apply the transducer in a downward direction to the upper-side string to perform Generation.

Apply the transducer in an upward direction to the lower-side string to perform Analysis.

Transducers Encode Finite-State Relations

Let a Relation X include the ordered string pairs

{ <“cantar[Verb][PInd][1P][Sg]”, “canto”>,

<“canto[Noun][Masc][Sg]”, “canto”> }

What is the upper-side Language of this Relation?

What is the lower-side Language of this Relation?

How can such a relation be encoded?

What do you get when you analyze the string “canto”?

What do you get when you generate from the string “cantar[Verb][PInd][1P][Sg]”?

Rules and Infinite Relations

One or both of the Languages related by a Finite-State Relation can

be infinite, e.g. the relation that relates lower-case words to their

upper-case versions:

{ <“a”, “A”>, <“aa”, “AA”>, <“dog”, “DOG”>, … }

a:A

b:B

c:C

d:DEtc, (assume arcs for all other symbols in the alphabet)

Apply this network in a downward direction to the input string “cad”. What is the output?

Alternation Rules We will write finite-state replace rules to describe alternations

between abstract morphophonemic words and well-formed surface words.

These replace rules look a lot like traditional “rewrite rules”

b -> p / _ [ t | s ]

These rules compile into finite-state transducers (relations) that can be used to compute these mappings.

Typically the upper language of a rule FST is the Universal Language, the set of all possible strings.

Typically the lower language is like the upper language, except for the alternations controlled by the rule.

Strings that don’t match the rule are mapped unchanged.

A Bit of History C. Douglas Johnson (1972), “Formal Aspects of Phonological Description”

Kaplan & Kay (1981), “Phonological Rules and Finite-State Transducers”

K. Koskenniemi (1983) “Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production”

1980 to 2004 Xerox work on finite-state algorithms, programming languages

Other finite-state implementations:

AT&T/Lucent Library and Lextools Fsa Utils 6 (Groningen Univ.) Daciuk’s Utils (related to Paris VII work) Jacaranda Grail

Rule Application = Composition Composition is an operation that merges two transducers

“vertically”. Let X be a transducer that contains the single ordered

pair < “dog”, “chien”>. Let Y be a transducer that contains the

single ordered pair <“chien”, “Hund”>. The composition of X over

Y, notated X .o. Y, is the relation that contains the ordered pair

<“dog”, “Hund”> .

Composition merges any two transducers “vertically”. If the

shared middle level has a non-empty intersection, then the result

will be a non-empty relation.

Rule application is done via composition.

Composition is a difficult topic that we will return to many times.

Read the rest of Chapter 1 and do exercises 1.10.3 and 1.10.4

Review: Basic Concepts

Language = a set of strings/words

Regular Language = a set of string/words that can be generated using

concatenation, union, iteration and similar operations

Finite-State Machine (FSM) or Finite-State Automaton = a finite-state

machine that accepts/recognizes a regular language

Regular Relation = a mapping between two regular languages

Finite-State Transducer (FST) = a two-level finite-state automaton that maps

between two regular languages (performs look-up and generation)

About to Take a Break

Questions about the Big Picture?

Questions about terminology?

Questions about applications?

The Arithmetic Expression Language Primitive arithmetic expressions (numerals): 1 5 234 12.345 etc.

If X is an arithmetic expression, then (X) is an arithmetic expression

If X is an arithmetic expression, then +X and –X are arithmetic expressions

If X and Y are arithmetic expressions, then X + Y, X – Y, X * Y and X / Y are arithmetic expressions

Operands, and operators

Conventions about precedence: 2 + 3 * 5 – 7 * 2

Think of arithmetic expressions as a Language (a set of strings)

Write some examples of strings in the Arithmetic Expression Language Write some examples of strings that are not in the Language

A valid arithmetic expression has a value (an evaluation)

The Language of Regular Expressions

The “Regular” in Regular Expressions is a technical term. Also called “Rational Expressions”.

Another kind of language, but with different operands and operators

A regular expression is a compact formula for describing a regular language or regular relation. The regular-expression language is a metalanguage.

Think of regular expressions as the primary “programming language” of xfst. We will have to learn this regular-expression notation.

Each implementation of regular expressions is slightly different (Python, Perl, emacs, …)

We will have to learn the Xerox flavor of regular expressions as used in xfst.

Regular Expressions Denoting a Language

Regular Expression

Regular Language

Finite-State Automaton

(“acceptor”)

describes compiles into

accepts/recognizes

Regular Expression Denoting a Relation

Regular Expression

Regular Relation

Finite-State Transducer

describes compiles into

maps

Introduction to xfst

xfst is an interface giving access to the finite-state operations (algorithms such as Union, Concatenation, Iteration, Intersection).

xfst includes a powerful and efficient regular-expression compiler.

xfst includes the lookup operation (‘apply up’) and the generation operation (‘apply down’) so that we can test our networks. For small examples, we can also print out all the words in the language using the command ‘print words’.

We have to learn the Xerox regular-expression metalanguage and simple commands to control xfst.

Xerox Regular-Expression Operators I a a simple symbol

c a t a CONCATENATION of three simple symbols

[ c a t ] grouping brackets

? denotes any single symbol

“[Noun]” or %+Noun or “+Noun”

“[Verb]” or %+Verb or “+Verb”

“[Adj]” or %+Adj or “+Adj”

single symbols with multicharacter names (“multicharacter symbols”)

cat Beware: will be compiled by xfst as a single multicharacter symbol

{cat} explosion brackets: equivalent to [ c a t ]

Xerox Regular Expression Operators II

[] 0 two ways to denote the empty (zero-length) string

Now, where A and B are arbitrarily complex regular expressions:

[A] bracketing; equivalent to A

A | B union

(A) optional; equivalent to [ A | 0 ]

A & B intersection

A B concatenation (N.B. the space between A and B)

A - B subtraction

Xerox Regular-Expression Operators III

A* Kleene star; zero or more iterations of A

A+ Kleene plus; one or more iterations of A

?* The Universal Language

~A The complement of language A; equivalent to [ ?* - A]

~[?*] The empty language (i.e. it contains no strings at all, not even the zero-length string)

%+ or “+”, the literal plus-sign symbol

%* or “*”, the literal asterisk symbol

and similarly for %?, %(, %), %~, etc.

Operators Denoting Relations

A .x. B the “cross-product”; relates every string in

A to every string in B, and vice versa; e.g.

[ g o .x. w e n t ] relates “go” and “went”

a:b shorthand for [ a .x. b ]

“[Pl]”:s shorthand for [ “[Pl]” .x. s ]

“[Past]”:{ed} shorthand for [ “[Past]” .x. e d ]

“[Prog]”:{ing} shorthand for [ “[Prog]” .x. i n g ]

Some Useful Abbreviations$A denotes the language of all strings that contain A;

equivalent to [ ?* A ?* ], e.g.

$b denotes the language of all strings that contain

a ‘b’ anywhere

A/B denotes the language of all strings in A, ignoring any

strings from B, e.g.

a*/b contains “a”, “aa”, “aaa”, … “ba”, “ab”, “aba”, ...

\A denotes any single symbol, minus strings in A; i.e. [ ? - A ],

e.g.

\b denotes any single symbol, except a ‘b’

Beware: \A is NOT to be confused with

~A the complement of A; i.e. [ ?* - A ]

Basic xfst interface commands TerminalPrompt% xfst

xfst> help

xfst> help union net

xfst> exit

xfst> read regex [ d o g | c a t ] ;

xfst> read regex < myfile.regex

xfst> apply up dog

xfst> apply down dog

xfst> pop stack

xfst> clear stack

xfst> save stack myfile.fsm

xfst saves networks in a LIFO stackxfst> read regex [ d o g | c a t ] ;

or


causes the compiled network to be “pushed” onto the stack. When you type

xfst> pop stack

the top network is popped off the stack and discarded. When you type

xfst> apply up dog

the top network on the stack is applied in an upward direction (lookup) on the string “dog”, and the related string or strings are printed. When you type

xfst> clear stack

the entire stack is popped and left empty. When you type


the contents of the stack are written in binary (compiled) form to the indicated file.

Setting Variables in xfstxfst> define Myvar

pops the top network off of the stack and saves it as the value of Myvar, which can be used in subsequent regular expressions

xfst> define Myvar2 [ d o g | c a t ] ;

assigns a value to Myvar2 without modifying the stack. It is equivalent to the two commands


xfst> define Myvar2

xfst> undefine Myvar

undefines Myvar and recycles the memory

Using Variables in Regular Expressions

xfst> define var1 [ b i r d | f r o g | d o g ] ;

xfst> define var2 [ d o g | c a t ] ;

You can now use var1 and var2 in subsequent regular expressions:

xfst> define var3 var1 | var2 ;

xfst> define var4 var1 var2 ;

xfst> define var5 var1 & var2 ;

xfst> define var6 var1 - var2 ;

Performing network operations on the stack


xfst> read regex [ m o u s e | r a t ] ;

xfst> read regex [ d e e r | s q u i r r e l ] ;

xfst> union net

‘union net’ will pop its arguments off of the stack one at a time, perform the union operation, and push the result back onto the stack, leaving just one network on the stack. Enter the command ‘words’ to see the resulting language.

The xfst Stack, hard to visualize

Assume that two networks have already been pushed onto the stack.

If we then invoke a stack-based operation like ‘union net’, the xfst algorithm pops its first argument from the top of the stack, then the second argument, and performs the union:

NetA | NetB

NetA

NetB

Remember that the stack is last-in, first-out (LIFO)

Ordered operations like ‘minus net’ and ‘compose net’ are often difficult to get right. E.g. Assume that we want to compute A - B on the stack. Try this

xfst> define A [ d o g | c a t | m o u s e | r a t ] ;

xfst> define B [ d o g | m o u s e | e l e p h a n t ] ;

Now push the arguments onto the LIFO stack in the right order and invoke ‘minus net’. If you have a defined variable X, you can push its value onto the stack using

xfst> push X

or

xfst> read regex X ;

What is your answer? Type ‘words’ to see the language of the resulting

network.

The xfst Stack and Ordered Operations

To perform NetA – NetB on the stack, the B net must be pushed onto the

stack first, then the A net, so that they can be popped off in the reverse

order.

When performing operations on the stack, try to visualize the stack itself.

NetA - NetB

NetA

NetB

Good advice for most: use defined variables and avoid The Stack

A little concatenation example xfst> define Root [ w a l k | t a l k | w o r k ] ;

xfst> define Prefix [ 0 | r e | o u t ] ;

xfst> define Suffix [ 0 | s | e d | i n g ] ;

xfst> read regex Prefix Root Suffix ;

xfst> words

xfst> apply up walking

Try to get the same result by starting with the same three definitions and then pushing them on the stack, invoking ‘concatenate net’ to perform the concatenation. Remember that concatenation is an ordered operation.

xfst file types Regex files: contain only a regular expression, terminated with a

semicolon and a newline



Binary files: contain an already compiled network or networks, e.g.


xfst> load stack myfile.fsm

Script files: contain a list of xfst commands; run them with ‘source’

xfst> source myfile.script

The Simplest Replace Rules Replace rules are a very powerful extension to the regular-expression

metalanguage. Here is the simplest kind needed for the kaNpat and Portuguese-pronunciation exercises. The arrow -> is typed as a hyphen followed by a right angle-bracket. The || operator consists of two vertical bars typed together. The _ is the underscore.

Rule Schema: upper -> lower

upper -> lower || leftcontext _ rightcontext

E.g.

xfst> read regex s -> z || [ a | e | i | o | u ] _ [ a | e | i | o | u ] ;

xfst> apply down casa

What is this rule intended to do? What comes out?

kaNpat example Assume a language that joins morpheme kaN (with an underspecified nasal N)

and morpheme pat into the underlying or morphophonemic form kaNpat. This language then has “alternation” rules that dictate that N, when followed by p, gets realized as m. And p, when preceded by m, gets realized as m. The derivation looks like

Underlying input: kaNpat

Rule1: N -> m || _ p

Output of Rule1: kampat

Rule2: p -> m || m _

Output of Rule2: kammat

The composition operation (.o.) reduces the derivational cascade of transducer networks into a single transducer network.

Your first cascade of rules

xfst> define Rule1 N -> m || _ p ;

xfst> define Rule2 p -> m || m _ ;

xfst> read regex Rule1 .o. Rule2 ;

xfst> apply down kaNpat

What is the output?

Now restart (with ‘clear stack’), define the two Rules as shown

above, push them on the stack in the right order, and perform the

composition on the stack using ‘compose net’. What is your result?

(Remember that the networks must be pushed in the right order.)

Rule Abbreviations Multiple left-hand sides, separated by commas:

b -> p, d -> t, g -> k || _ .#.

Multiple right-hand sides, separated by commas:

e -> i || _ (s) .#. , .#. p _ r

Use .#. to refer to either the very beginning or the very end of a word.

Typing Accented Letters in emacs

The COMPOSE key is to the right of the space bar.

COMPOSE a ” yields ä

COMPOSE a ’ á

COMPOSE a ` à

COMPOSE a ^ â

COMPOSE a ~ ã

COMPOSE c , ç

For doing some of the exercises you will need to find out how to type accented Roman letters using your own text editor.

A Trick for Testing Multiple Words

The exercise will ask you to write a cascade of rules that map orthographical strings to something more like a phonemic notation.

1. Type the test words into a text file, e.g. “wordlist”, one word per line

2. Compile your rules, compose them, and push the result onto The Stack

3. Test all the words using the following syntax:

xfst[1]: apply down < wordlist

Assignment

Read Chapter 2 (The Systematic Introduction) when you can.

For hands-on practice right now start reading Chapter 3, doing the examples as you go along.

Do the kaNpat exercise in section 3.5.3 and the Southern Brazilian Portuguese exercise in 3.5.4.

Progress to the Bambona (p. 153) and Monish exercises (p. 162) if you get bored.

Don’t Panic There is a lot to absorb, but the exercises progress step by

step.

You are supposed to review the same material in the book.

You are encouraged to ask questions.

You can help each other.

This is not a contest.

Tuesday morning is almost all review.

Monday Afternoon

Documents

Transcript of Monday Afternoon