LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations...

LING/C SC/PSYC 438/538

Lecture 27Sandiway Fong

Administrivia

• 2nd Reminder– 538 Presentations– Send me your choices if you haven’t already

english26.pl[Background: chapter 12 of JM contains many grammar rules]

• Subject of passive in by-phrase– the sandwich was eaten by John

• Questions– Did John eat the sandwich?– Is the sandwich eaten by John?– Was John eating the sandwich?– Who ate the sandwich?– What did John eat? (do-support)– Which sandwich did John eat?– Why did John eat the sandwich?

english26.plDisplacement rule: was John eating the sandwich ?was John was eating the sandwichAx [NP ] [VP Ax ]

english26.pl

• Yes-no question (without aux inversion)

Blocked by {Ending=root} constraint

english26.pl

• Passives and progressives: nested constructions …

[Passive [Progressive] ]

[Progressive [Passive] ]

english26.pl• Nesting order forced by rule chaining

– progressive passive VP_nptrace– passive VP_nptrace– passive progressive …

Homework 5English grammar programming• use english26.pl• Add rules to handle the following sentences• Part 1: Raising verbs

– John seems to be happy– It seems that John is happy– It seems that John be happy– John seems is happy

• Part 2: PP attachment ambiguity– I saw the boy with a telescope– is ambiguous between two readings– Your grammar should produce both parses

• Part 3: recursion and relative clauses– I recognized the man– I recognized the man who recognized you– I recognized the man who recognized the woman who you recognized

• Explain your parse trees• Submit your grammar and runs.

Why can’t computers use English?

• Context– a linguist’s view:

• a list of examples that are hard for computers to do

– a computational linguist’s view (mine): • these actually aren’t very hard at all... armed

with some DCG technology, we can easily write a grammar to that make the distinctions outlined in the pamphlet

– You could easily…• write a grammar for these examples

Online parsers: Berkeley parser/Stanford parser trainedon the Penn treebank

If computers are so smart, why can't they use simple English?

• Consider, for instance, the four letters read; they can be pronounced as either reed or red. How does the machine know in each case which is the correct pronunciation? Suppose it comes across the following sentences:

• (l) The girls will read the paper. (reed) • (2) The girls have read the paper. (red) • We might program the machine to pronounce read as reed if it

comes right after will, and red if it comes right after have. But then sentences (3) through (5) would cause trouble.

• (3) Will the girls read the paper? (reed) • (4) Have any men of good will read the paper? (red) • (5) Have the executors of the will read the paper? (red) • How can we program the machine to make this come out right?

If computers are so smart, why can't they use simple English?

• (6) Have the girls who will be on vacation next week read the paper yet? (red)• (7) Please have the girls read the paper. (reed)• (8) Have the girls read the paper?(red)• Sentence (6) contains both have and will before read, and both of

them are auxiliary verbs. But will modifies be, and have modifies read. In order to match up the verbs with their auxiliaries, the machine needs to know that the girls who will be on vacation next week is a separate phrase inside the sentence.

• In sentence (7), have is not an auxiliary verb at all, but a main verb that means something like 'cause' or 'bring about'. To get the pronunciation right, the machine would have to be able to recognize the difference between a command like (7) and the very similar question in (8), which requires the pronunciation red.

Example

• (5) Have the executors of the will read the paper? (red)

Treebanks• Treebank

– A corpus of sentences – Each sentence has been parsed– POS tags assigned– Also labels for phrases

• A treebank– is also a grammar– we can extract the rules, also

frequency counts

• A consistently labeled treebank– might be called a “grammatical

theory”

• Most popular treebank– Penn Treebank– Available on cd from UA

Library (search the catalog)– particularly the Wall Street

Journal (WSJ) section– 50,000 sentences– used for training stochastic

context-free grammars (PCFGs)– Results: around the 90% mark

on bracketed precision/recall

– contains also traces and indices (typically not used for PCFGs)

Penn Treebank

• What is in it?– (v3) Four parsed sections

• One million words of 1989 Wall Street Journal (WSJ) material• ATIS-3 sample• Switchboard• Brown Corpus

• Example: wsj_001.mrg( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN

director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

In the NLP literature, “Penn Treebank” usually refers to the WSJ section only

Pierre Vinken, 61 years old, will join the board as nonexecutive director Nov. 29.

Penn Treebank

• What is in it?– Part-of-speech (POS) labels on words, numbers and punctuation using the 48-tag Penn

tagset (a simplification of the 1982 Francis & Kucera Brown corpus tagset), e.g. NN, VB, IN, JJ

– Constituents identified and labeled with syntactic categories,e.g. S, NP, VP, PP

– Additional sublabels to facilitate predicate-argument extraction, e.g. -SBJ, -CLR, -TMP

Penn Treebank

• WSJ section of the Penn Treebank– has become the standard training corpus and testbed for statistical NLP

• Other Penn treebanks– Arabic, Chinese and Korean

• Other formalisms– (Combinatory Categorial Grammar) CCG treebank– Dependency grammar

• http://en.wikipedia.org/wiki/Treebank– lists about 50 treebanks in 29 languages

Penn Treebank

• The formalism chosen (sorta) matters– Penn Treebank includes

empty categories (ECs), including traces

– CCG has slash categories– Dependency grammar-

based treebanks don’t – also don’t have node labels

• Example: wsj_100.mrg( (S (NP-SBJ (NNP Nekoosa) ) (VP (VBZ has) (VP (VBN given) (NP (DT the) (NN offer) ) (NP (NP (DT a) (JJ public) (JJ cold) (NN shoulder) ) (, ,) (NP (NP (DT a) (NN reaction) ) (SBAR (WHNP-2 (-NONE- 0) ) (S (NP-SBJ (NNP Mr.) (NNP Hahn) ) (VP (VBZ has) (RB n't) (VP (VBN faced) (NP (-NONE- *T*-2) )

Penn Treebank

• Example: wsj_100.mrg( (S (NP-SBJ (NNP Nekoosa) ) (VP (VBZ has) (VP (VBN given) (NP (DT the) (NN offer) ) (NP (NP (DT a) (JJ public) (JJ cold) (NN shoulder) ) (, ,) (NP (NP (DT a) (NN reaction) ) (SBAR (WHNP-2 (-NONE- 0) ) (S (NP-SBJ (NNP Mr.) (NNP Hahn) ) (VP (VBZ has) (RB n't) (VP (VBN faced) (NP (-NONE- *T*-2)

(PP-LOC (IN in) (NP (NP (PRP$ his) (CD 18) (JJR earlier) (NNS acquisitions) ) (, ,) (SBAR (WHNP-3 (DT all) (WHPP (IN of) (WHNP (WDT which) ))) (S (NP-SBJ-1 (-NONE- *T*-3) ) (VP (VBD were) (VP (VBN negotiated) (NP (-NONE- *-1) ) (PP-LOC (IN behind) (NP (DT the) (NNS scenes) )))))))))))))))) (. .) ))

Penn Treebank

• The formalism chosen (sorta) matters– Penn Treebank includes empty categories, including traces

• It is standard in the statistical NLP literature to first discard all the empty category information– both for training and evaluation– some exceptions:

• Collins Model 3• post processing to re-insert ecs

Penn Treebank

• How is it used?– One million words of 1989

Wall Street Journal (WSJ) material

– nearly 50,000 sentences (49,208)divided into 25 sections (0–24)

– sections 2–21 contain 39,832 sentences

– section 23 (2,416 sentences) is held out for evaluation

• Standard practice

trainingsections 2–21

evaluation

0

2423

90%

Treebank Software

• Tgrep2• by Doug Rohde

– http://tedlab.mit.edu/~dr/TGrep2/

– Download and install for Linux (pre-compiled and works without compilation on your Linux if you’re lucky)

– For MacOSX just re-compile– (also will need the DRUtils

library)

– described in the textbook– works on the command line

• Java Package– Tregex from Stanford– broadly compatible with Tgrep2– http://nlp.stanford.edu/software/tr

egex.shtml– Jar file (should run on all platforms)– has a graphical user interface– file run-tregex-gui.bat

• (batch file for Windows) • See file: set max memory to 500m

(or larger) to use with entire treebank

• Also TIGERsearch– http://www.ims.uni-stuttgart.de/pr

ojekte/TIGER/TIGERSearch/– Windows explicitly supported

http://tedlab.mit.edu/~dr/TGrep2/

http://tedlab.mit.edu/~dr/TGrep2/

http://nlp.stanford.edu/software/tregex.shtml

http://nlp.stanford.edu/software/tregex.shtml

http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/

http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/

LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations...

Documents

Transcript of LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations...