Tutorial - I
description
Transcript of Tutorial - I
Tutorial - I
2nd September 2005
Problem 1: N-grams
Let C be a natural language corpus consisting of N tokens and V types w1, w2, ..., wV. Let pi be the unigram probability of wi estimated from C. Also, given that ij, i < j pi pj
a. Give an estimate for pi in terms of N, V, and i.
b. An artificial corpus C1 was generated stochastically on the basis of the unigram probabilities pi. Estimate the bigram probabilities pij = P(wi wj) for C1 in terms of N, V, i & j. [Hint: Use the expression for pi derived above]
Soln.
Soln.
Problem 1: N-grams (contd.)
c. Show that the bigram distribution of C1 does not follow Zipf’s law perfectly. For this, use the estimated expression for pij derived in (b).
d. It is known that natural languages exhibit Zipfian distribution over n-grams for all n. Can you use this fact to show that the bigram characteristics of C1 is different from C.
e. Prove the generalization of (d), i.e. “for any finite n, a stochastically generated corpus Cn based on the n-gram estimates of C has different (n+1)-gram characteristics from C”. What can you infer from this about n-gram models for natural languages?
Soln.
Soln.
Soln.
Problem 2: Problematic AND!
Given below is a toy grammar G for English.
S NP VP
| S CNJ S
VP V NP | V S |
VP CNJ VP
NP NP CNJ NP | N
CNJ and
N John | Mary
V liked | said
Problem 2: Problematic AND! (contd.)
a. Show that the sentence “John liked Mary and Mary liked John” is ambiguous for G. Point out the parse(s) that you think is/are semantically correct.
b. The sentence “John said John and Mary liked John”? has the same structure as that of (a). Is the semantically valid parse for (a) also meaning-ful for (b)? Why or why not?
Soln.
Soln.
Problem 2: Problematic AND! (contd.)
c. The ambiguity arises because and can connect noun and verb phrases as well as clauses. Can you suggest a method to resolve this (at least partially) by
i. Verb sub-categorization
ii. By introducing new POS categories (not for verbs) and augmenting G accordingly. [Assume that POS tagging is a step before parsing and the process is perfect]
Soln.
Problem 3: Geo-Morph Consider the following pairs of the name of the
Geographical location and the corresponding terms for their dwellers. Let us call this system of morphology Geo-Morph.
Geo-root Dweller Geo-root Dweller
Assam
Burma
China
Denmark
Egypt
France
Assamese
Burmese
Chinese
Danish
Egyptian
French
Georgia
Holland
India
Japan
Korea
London
Georgian
Dutch
Indian
Japanese
Korean
Londoner
Problem 3: Geo-Morph (contd.)
a. Classify Geo-Morph as derivational/inflectional and linear/non-linear system of morphology.
b. Identify the set of affixes. Classify the examples as regular and irregular cases. Classify the regular cases further by the affixes.
c. Identify the different morphological paradigms. Can you classify the Geo-roots based on their graphemic/phonemic structure into these paradigms?
d. Design rewrite rules to capture orthographic changes for these paradigms.
Soln.
Soln.
Problem 3: Geo-Morph (contd.)
e. Predict the dweller terms for the following Geo-roots based on the morphological system developed with the help of the paradigms and the rewrite rules (c-d). Which of them do you think are used in standard English?
o Sweden o Omano Libyao Viennao Europe
Soln.
SOLUTIONS
Solution 1(a): N-grams
a) ij, i < j pi pj implies that wi s are sorted in descending order of unigram probability, i.e. frequencies. In other words, the rank (according to frequency) of wi is i.
According to Zipf’s law, frequency rank = constant
i Npii
pi
Σ1...Vpi
k
pi
=
=
=
k (some constant)
k / (N i) (I)
(k/N) Σ1...V(1/i)
N/lnV
1/(i lnV) (from I)
Solution 1(b): N-grams
b) Since C1 was generated stochastically based on the unigram probabilities only, the two tokens ts and ts+1 in C1 were generated independent of each other. In other words, the events ts = wi and ts+1 = wj are independent.
Therefore,
pij = P(ts = wi ts+1 = wj)
= P(ts = wi) P(ts+1 = wj)
= pi pj
1/(ij ln2V)
Solution 1(c): N-grams
c) If the bigram distribution of C1 has to follow Zipf’s law, then bigram-probability bigram-rank = constant (say k’),
We know that pij 1/(ij ln2V) Therefore, first few bigram probabilities in order of rank are
p1,1, p1,2, p2,1, p3,1, p1,3, p4,1, ...
k’ = p1,1 1 = 1/ ln2V But, then
p2,1 = 1/2ln2V 1/3ln2V
p3,1 = 1/3ln2V 1/4ln2V
p1,3 = 1/3ln2V 1/5ln2V
Thus, it does not follow Zipf’s law (and even Mandelbrot’s law)
Solution 1(d): N-grams
d) It follows from (c) that the bigram distribution of C1 does not
follow Zipf’s law, whereas that of C does. Therefore, the bigram characteristics of the two distribution must be different.
We know that for C1, pij 1/(ij ln2V).
However, just as in (a) we can estimate the bigram distribution of C from the Zipfian assumption. There are V2 probabilities.
Therefore, we can assume that [br is the probability of the rth bigram.
br = 1/(2rlnV)
But, this estimate may be quite erroneous. Why?
Solution 1(e): N-gramse) Hint: Assume Zipf’s law for n-grams. Estimate n+1-gram
probabilities from n-grams (product of two n-gram probabilities). Now show that n+1-grams does not follow Zipf's law
Try to prove the following (more general) results: Mandelbrot’s law, a generalization of Zipf’s law says
(frequency + ρ) rankα = constant. Prove (c), (d) and (e) when the distribution follows Mandelbrot’s law rather than Zipf’s law.
For any finite length corpus (i.e. when N is finite), we cannot have n-gram distributions that follow Mandelbrot’s law perfectly.
Solution 2(a): Problematic AND!
John liked Mary and Mary liked John
N V N CNJ N V N
NP V NP CNJ NP V NP
NP VP CNJ NP VP
S CNJ S
S
PARSE 1
Solution 2(a): Problematic AND!
John liked Mary and Mary liked John
N V N CNJ N V NNP V NP CNJ NP V NP
NP V NP V NP
NP V NP VP
NP V S
NP VP
S
PARSE 2
Solution 2(b): Problematic AND!
John said John and Mary liked John
N V N CNJ N V N
NP V NP CNJ NP V NP
NP VP CNJ NP VP
S CNJ S
S
PARSE 1
Solution 2(b): Problematic AND!
John said John and Mary liked John
N V N CNJ N V NNP V NP CNJ NP V NP
NP V NP V NP
NP V NP VP
NP V S
NP VP
S
PARSE 2
Solution 2(c): problematic AND!
Verb Sub-categorization: Verbs liked and said belong to subcategories 1 and 2 respectively, where VP V NP [For V in 1] VP V S [For V in 2]
POS category Augmentation: Break CNJ into two categories CNJP and CNJC for phrasal and clausal conjunctions respectively. The grammar G is augmented as:
Solution 2(c): problematic AND!
The new G for English.
S NP VP
| S CNJC S
VP V NP | V S |
VP CNJP VP
NP NP CNJP NP
| N
CNJC and
CNJP and
N John | Mary
V liked | said
Solution 2(c): Problematic AND!
John liked Mary and Mary liked John
N V N CNJC N V N
NP V NP CNJC NP V NP
NP VP CNJC NP VP
S CNJC S
S
Parsing using the new grammar
Solution 2(c): Problematic AND!
John said John and Mary liked John
N V N CNJP N V NNP V NP CNJP NP V NP
NP V NP V NP
NP V NP VP
NP V S
NP VP
S
Parsing using the new grammar
Solution 2(b): Problematic AND!
John said John and Mary liked John
N V N CNJP N V N
NP V NP CNJP NP V NP
NP VP CNJP NP VP
S CNJP S
Cannot parse otherwise
Solution (3ab): Geo-Morph
Derivational and Linear Irregulars are shown in red, affixes: n, ese
Nation Dweller Nation Dweller
Assam
Burma
China
Denmark
Egypt
France
Assamese
Burmese
Chinese
Danish
Egyptian
French
Georgia
Holland
India
Japan
Korea
London
Georgian
Dutch
Indian
Japanese
Korean
Londoner
Solution (3cd): Geo-Morph
c. Based on endings of the roots we might try to classify them into 4 paradigms [C:consonant-y, V:Vowel+y]:
o CVa, [V/a]CC* takes n, o Ca, aC takes ese
d) The Rewrite rules: n ian / C^_$ (Egypt^n Egyptian) a Φ/C_^ese (China^ese Chinese etc.)
Solution (3e): Geo-Morph
Root Paradigm Suffix concatenation
After rewrite
Standard
forms
Sweden [V/a]CC* Sweden^n Swedenian Swedish
Oman aC Oman^ese Omanese Omani?
Libya CVa Libya^n Libyan Libyan
Vienna Ca Vienna^ese Viennese Viennese
Europe *** *** *** European
A Problem to Ponder
Try to design a complete set of morphological rules for English Geo-Morph How many affixes, paradigms and exceptions do
you expect? Is it possible to classify the Geo-roots based solely
on the graphemic/phonemic forms?