1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the...

32
1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts Description of problem and Review of relevant literature (not just the study you are going to replicate, but related things too) Description and discussion of your own results First part (1000-1500 words) due in Friday 25 April Second part (1500-2000 words) due in Friday 9 May No overlap allowed with LELA30122 projects Though you are free to use that list of topics for inspiration See LELA30122 WebCT page, “project report”

Transcript of 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the...

Page 1: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

1/32

Assignments

• Basic idea is to choose a topic of your own, or to take a study found in the literature

• Report is in two parts– Description of problem and Review of relevant literature (not just

the study you are going to replicate, but related things too)– Description and discussion of your own results

• First part (1000-1500 words) due in Friday 25 April• Second part (1500-2000 words) due in Friday 9 May• No overlap allowed with LELA30122 projects

– Though you are free to use that list of topics for inspiration– See LELA30122 WebCT page, “project report”

Page 2: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

Church et al. 1991

K Church, W Gale, P Hanks, D Hindle (1991) Using Statistics in Lexical Analysis, in U Zernik (ed) Lexical Acquisition: Exploiting on-line resources to build a lexicon. Hillsdale NJ (1991): Lawrence Erlbaum, pp. 115-164.

Page 3: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

3/32

Background

• Corpora were becoming more widespread and bigger

• Computers becoming more powerful• But tools for handling them still relatively

primitive• Use of corpora for lexicology• Written for the First International Workshop on

Lexical Acquisition, Detroit 1989• In fact there was no “Second IWLA”• But this paper (and others in the collection)

become much cited and well known

Page 4: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

4/32

The problem

• Assuming a lexicographer has at their disposal a reference corpus of considerable size, …

• A typical concordance listing only works well with – words with just two or three major sense divisions– preferably well distinct– and generating only a pageful of hits

• Even then, the information you may be interested in may not be in the immediate vicinity

Page 5: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

5/32

Page 6: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

6/32

The solution

• Information Retrieval faces a comparable problem (overwhelming data), and suggests a solution

1. Choose an appropriate statistic to highlight information “hidden” in the corpus

2. Preprocess the corpus to highlight properties of interest

3. Select an appropriate unit of text to constrain the information extracted

Page 7: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

7/32

Mutual Information

• MI: a measure of similarity• Compares the joint probability of observing two

words together with the probabilities of observing them independently (chance)

)()(

),(log);( 2 yPxP

yxPyxI

N

xfxP

)()(

• If there is a genuine association, I(x;y)>>0• If no association, P(x,y) P(x)P(y), I(x;y) 0• If complementary distribution, I(x;y)<<0

Page 8: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

8/32Top ten scoring pairs of strong y and powerful yData from AP corpus, N=44.3m words

Page 9: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

9/32

Mutual Information

• Can be used to demonstrate a strong association

• Counts can be based on immediate neighbourhood, as in previous slide, or on co-occurrence within a window (to left or right or both), or within same sentence, paragraph, etc.

• MI shows strongly associated word pairs, but cannot show the difference between, eg strong and powerful

Page 10: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

10/32

t-test

• A measure of dissimilarity• How to explain relative strength of collocations

such as– strong tea ~ powerful tea– powerful car ~ strong car

• The less usual combination is either rejected, or has a marked contrastive meaning

• Use example of {strong|powerful} support because tea rather infrequent in AP corpus

Page 11: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

11/32

{strong|powerful} support

• MI can’t help: very difficult to get value for I(powerful;support)<<0 because of size of corpus– Say x and y both occur about 10 times per 1m

words in a corpus– P(x) = P(y) = 10-5 and chance P(x)P(y) = 10-10

– I(powerful;support)<<0 means P(x)P(y) << 10-10

– ie much less than 1 in 10,000,000,000– Hard to say with confidence

Page 12: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

12/32

Rephrase the question

• Can’t ask “what doesn’t collocate with powerful?”• Also, can’t show that powerful support is less

likely than chance: in fact it isn’t– I(powerful;support)=1.74

– 3 x greater than chance!

• Try to compare what words are more likely to appear after strong than after powerful

• Show that strong support relatively more likely than powerful support

Page 13: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

13/32

t-test

• Null hypothesis (H0)– H0 says that there is

no significant difference between the scores

• H0 can be rejected if– Difference of at least

1.65 sd’s– 95% confidence – ie the difference is real

NyxfN

yfxfNyxf

t),(

)()(),(2

))()(()),((

)()(),(22 yPxPyxP

yPxPyxPt

Page 14: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

14/32

t-test

• Comparison of powerful support with chance is not significant

• t = 0.99 (less than 1 sd!)• But if we compare

powerful support with strong support, t = –13

• Strongly suggests there is a difference

22

),(),(

),(),(

Nwyf

Nwxf

Nwyf

Nwxf

t

Page 15: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

15/32

22

22

22

)(

),()(,

)(

),()(

))|(())|((

)|()|(

powerfulf

wpowerfulfpowerful

strongf

wstrongfstrong

powerfulwPstrongwP

powerfulwPstrongwPt

Page 16: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

16/32

• MI and t-score show different things

Page 17: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

17/32

How is this useful?

• Helps lexicographers recognize significant patters

• Especially useful for learners’ dictionaries to make explicit the difference in distribution between near synonyms

• eg what is the difference between a strong nation and a powerful nation?– Strong as in strong defense, strong economy, strong

growth– Powerful as in powerful posts, powerful figure,

powerful presidency

Page 18: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

18/32

Taking advantage of POS tags

• Looking at context in terms of POS rather than lexical items may be more informative

• Example, how can we distinguish to as an infinitive marker from to as a preposition?

• Look at words which immediately precede to– able to, began to, … vs back to, according to, …

• t-score can show that they have a different distribution

Page 19: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

19/32

Page 20: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

20/32

• Similar investigation with subordinate conjunction that (fact that, say that, that the, that he) and demonstrative pronoun that (that of, that is, in that, to that)

• Look at both preceding and following word

• Distribution is so distinctive that this process can help us to spot tagging errors

Page 21: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

21/32

Page 22: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

22/32

subordinate conjunction demonstrative pronoun

t w that/cs w that/dt w t w that/cs w that/dt w 14.19 227 2 so/cs –12.25 1 151 of/in

Page 23: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

23/32

If your corpus is parsed

• Looking for word sequences can be limiting

• More useful if you can extract things like subjects and objects of verbs

• (Can be done to some extent by specifying POS tags within a window, but that’s very noisy)

• Assuming you can easily extract, eg Ss, Vs, and Os …

Page 24: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

24/32

What kinds of things do boats do?

Page 25: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

25/32

Page 26: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

26/32

What is an appropriate unit of text?

• Mostly we have looked at neighbouring words, or words within a defined context

• Bigger discourse units can also provide useful information

• eg taking entire text as the unit:– How do stories that mention food differ from

stories that mention water?

Page 27: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

27/32

Page 28: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

28/32

• More subtle distinctions can be brought out in this way

• What’s the difference between a boat and a ship?• Notice how immediately neighbouring words won’t

necessarily tell much of a story• But words found in stories that mention

boats/ships help to characterize the difference in distribution, and give a clue as to the difference in meaning

• Notice that human lexicographer still has to interpret the data

Page 29: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

29/32

Page 30: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

30/32

Word-sense disambiguation

• The article also shows how you can distinguish two senses of bank– Identify words which occur in the same text as

bank and river on the one hand, and bank and money on the other

Page 31: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

31/32

bank (river) vs bank (money) t bank&river bank&money w6.63 45 4 river4.90 28 13 River4.01 20 13 water3.57 16 11 feet3.46 23 39 miles3.44 21 32 near3.27 12 5 boat3.06 14 16 south2.83 8 1

fisherman2.83 21 49 along2.76 11 12 border2.74 17 35 area2.72 9 6 village2.71 7 0 drinking2.70 16 32 across2.66 9 7 east2.58 7 2 century2.53 10 13 missing

t bank&river bank&money w-15.95 6 467 money-10.70 2 199 Bank-10.60 0 134 funds-10.46 0 131 billion-10.13 0 124

Washington-10.13 0 124 Federal- 9.43 0 110 cash- 9.03 1 134 interest- 8.79 1 129 financial- 8.79 0 98 Corp- 8.38 1 121 loans- 8.17 0 87 loan- 7.57 0 77 amount- 7.44 0 75 fund- 7.38 1 102 William- 7.31 1 101 company- 7.25 1 101 account- 7.25 0 72 deposits

Page 32: 1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

32/32

Bank vs bank Bank bank

t Bank bank w t Bank bank w

35.02 1324 24 Gaza -36.48 1284 3362 bank

34.03 1301 36 Palestinian -10.93 900 1161 money

33.60 1316 48 Israeli -10.43 624 859 federal

33.18 1206 26 Strip - 9.59 586 786company

32.98 1204 29 Palestinians - 8.47 282 430accounts

32.68 1339 72 Israel - 8.26 544 693 central

31.56 4116 1284 Bank - 8.21 408 554 cash

31.13 1151 47 occupied - 8.21 675 816business

30.79 1104 40 Arab - 7.74 546 676 loans

27.97 867 21 territories - 7.54 52 140robbery