Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

26
EBERHARD-KARLS-UNIVERSITÄT TÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner SFB 441, University of Tübingen Linguistic Evidence, 2-4 February 2006

description

Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies. Ilona Steiner SFB 441, University of Tübingen Linguistic Evidence, 2-4 February 2006. Overview. Previous work on the relationship between parsing preferences and corpus frequencies - PowerPoint PPT Presentation

Transcript of Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

Page 1: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Coordinate Structures: On the Relationship between Parsing Preferences

and Corpus Frequencies

Ilona SteinerSFB 441, University of Tübingen

Linguistic Evidence, 2-4 February 2006

Page 2: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

2

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Overview

Previous work on the relationship between parsing preferences and corpus frequencies

Study by Gibson & Schütze (1999)Study by Mitchell & Brysbaert (1998)Reanalysis by Desmet et al. (2002)

Corpus analysis I: Parallel-structure effectCorpus analysis II: Disambiguation preference NP vs. S coordinationDiscussion

Page 3: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

3

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Tuning hypothesis (Cuetos et al., 1996)

People prefer the most frequently occurring resolution of an ambiguity:

Initial parsing preferences in syntactically ambigous sentences are determined by people's previous exposure to similar structures.

→ Parsing preferences and corpus frequencies should be correlated, i.e., the preferred construction should occur more frequently in corpora than the unpreferred construction

Page 4: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

4

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Study by Gibson & Schütze (1999)

Disambiguation preferences in English noun phrase conjunction: „NP1 Prep NP2 Prep NP3 and NP4“

(1) The talkshow host told [NP1: a joke] about [NP2: a man] with [NP3: an umbrella] and …

Preference in experiments: NP3 > NP1 > NP2 Preference in corpus data: NP3 > NP2 > NP1

→ No correlation

Conclusion: „…the sentence comprehension mechanism is not using corpus frequencies in arriving at its preference in this ambiguity and hence the decision principles of sentence comprehension and sentence production must be partially distinct.“

Page 5: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

5

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Study by Mitchell & Brysbaert (1998)

Attachment preference in Dutch: „NP1 Prep NP2 Relative Clause“

(2) Someone shot [NP1: the servant] of [NP2: the actress] [RC: who was on the balcony]

Preference in experiments: NP1 > NP2

Preference in corpus data: NP2 > NP1

→ No correlation

Were the experimental stimuli representative for the sentences in the corpus?

Page 6: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

6

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Desmet et al. (2002): Reanalysis of Mitchell & Brysbaert's (1998) corpus

„NP1 of NP2 RC“ Experimental stimuli: both NPs referring to humansCorpus data: very few sentences with both NPs

referring to humans

Preference depends on nature of NP1:

NP1 / human: NP1 preference

NP1 / non-human: NP2 preferenceSentence completion studies with corpus sentences confirmed this pattern → correlation!→ If corpus data are controlled carefully to match the experimental sentences, the discrepancy between sentence comprehension and production disappears

Page 7: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

7

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Aims

Relationship between two types of linguistic evidence with respect to coordinate structures in English:

Parsing preferences

Corpus frequencies

Focus on two processing effects that have been found in reading-time studies:

Parallel-structure effect

Disambiguation preference in noun phrase vs. sentence coordination

Comparison to corpus data in English Verbmobil treebank (TüBa-E, Hinrichs et al. (2000))

Page 8: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

8

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

English Verbmobil Treebank (TüBa-E)

Spoken dialogs in the domain of business appointments

Annotated (manually) at the levels of

Morpho-syntax (parts-of-speech categories)

Syntactic phrase structure

Function-argument structure

Unedited naturally occurring dialogs: data reflect mechanisms of sentence production, and not factors that are due to editing processes

Page 9: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

9

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Parallel-structure effect

In a coordinate structure, the second conjunct is read faster when it is structurally similar to the first one. (Frazier, Munn & Clifton, 2000)

(3) Terry wrote [NP: a long novel] and [NP: a short poem].

(4) Terry wrote [NP: a novel] and [NP: a short poem].

→ [NP: a short poem] faster in (3) than in (4)

Preference for structural similarity in coordination also present in corpora?

Page 10: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

10

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Coordination dataset

Analysis of ca. 3000 sentences (CD13) of TüBa-E

Coordination within complete sentences, conjunction: and → 274 occurrences of coordinate structures

For each coordinate structure extraction ofSyntactic category of mother and daughters

Grammatical function of mother

Number of conjuncts

Length of conjuncts (number of words)

Degree of redundancy in conjuncts (0 - 100%)

Syntactic annotation in the treebank used as basis for the extraction process

Page 11: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

11

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Corpus analysis I: Degree of redundancy

Manual inspection of degree of redundancy for each coordinate structure (with 2 conjuncts)

100%: both conjuncts have exactly the same structure including PoS-tags

0%: not a single node is redundant

Page 12: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

12

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Example sentence with 36% redundancy

Page 13: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

13

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Example sentence with 36% redundancy

Page 14: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

14

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Example sentence with 100% redundancy

Page 15: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

15

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Example sentence with 100% redundancy

Page 16: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

16

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Redundancy in coordination dataset

Distribution of redundancy from 0% to 100%

36% of all coordinate structures contained conjuncts with identical structure (100%) How often does structural similarity occur randomly?

010

2030

4050

607080

90100

<10% <30% <50% <70% <90% =100% Degree of redundancy

No.

of o

ccur

renc

es

Page 17: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

17

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Redundancy in coordination dataset

Distribution of redundancy from 0% to 100%

36% of all coordinate structures contained conjuncts with identical structure (100%) How often does structural similarity occur randomly?

010

2030

4050

607080

90100

<10% <30% <50% <70% <90% =100% Degree of redundancy

No.

of o

ccur

renc

es

Page 18: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

18

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Random dataset

For each first conjunct in coordination dataset: extraction of random second 'conjunct' (randomly chosen from the corpus, independent of coordination)

(5) [NP: Twenty second] and [NP: twenty fourth] are pretty bad.“ Randomly chosen phrase: [NP: the first]

Random phrase matches original second conjunct in syntactic category (here: NP)

grammatical function (here: SBJ) length (here: 2 words)

How many occurrences in random dataset are structurally identical?

Page 19: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

19

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Results: Parallel-structure effect

Structural similarity:

13% in random dataset (pairs of original first and random

second conjunct)

36% in coordination dataset

Difference is highly significant (χ²(1) = 72.1; p < 0.001)

→ Structural similarity within coordination is significantly more frequent in corpus data than structural similarity of two phrases independent of coordination.

→ Corpus data match preference for structural similarity during parsing

Page 20: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

20

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Distribution of parallel occurrences

Length effect: the shorter the conjuncts, the more frequently structural similarity occurs (F(1,3) = 6.63; p < 0.05for length 1 vs. 2)

0

10

20

30

40

50

60

70

80

1 2 3 4 5 9

CoordinationRandom

Length of conjuncts

% P

aral

lel c

onju

ncts

Page 21: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

21

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Disambiguation preference: NP vs. S coordination

Reading-time experiments (Frazier, 1979):

(6) a. Peter kissed [NP: Mary] and [NP: her sister] too

b. [S: Peter kissed Mary] and [S: her sister laughed]

Local ambiguity that cannot be resolved prior to the last word

Garden-path effect at “laughed“ in (6b) compared to “too“ in (6a)

Preference to interpret [her sister] as part of a conjoined NP, and not as the beginning of a new sentence

Is the preference for NP coordination (compared to S coordination) also present in corpora?

Page 22: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

22

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Corpus analysis II: NP vs. S coordination

Analysis of CD6 (2554 sentences) and CD13 (2906 sentences) of TüBa-E

Extraction of all occurrences with the form:

“NPSubj …Verb…NPCompl and…“

Continuation with NP and S coordination should both be semantically possible

(7) a. and I will bring [NP: the doughnuts] and [NP: coffee] I guess

b. and [S: Friday I have a nine to ten meeting] and [S: I also have a meeting in the early afternoon]

*c. I hate to mix [NP: business] and [NP: weekends]

Page 23: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

23

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Results: NP vs. S coordination

“NPSubj …Verb…NPCompl and…“

Coordination with NP: 69% (n = 27)

Coordination with S: 31% (n = 12)

[ t(38) = 2.57; p < 0.05 ]

→ Corpus frequencies match preference for NP coordination (compared to S coordination) during parsing

Page 24: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

24

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Discussion 1

Correlation between parsing preferences and corpus frequencies for

Parallel-structure effect

Disambiguation preference NP vs. S coordination

Possible explanations:

“Tuning hypothesis“: statistical pattern in natural language causes development of parsing preferences

Common source for language production and comprehension

Page 25: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

25

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Discussion 2

If there is a common source, structure-based accounts would explain parsing preferences

Parallel-structure effect: Recycling mechanism (Steiner, 2005)

Preference for NP coordination: “Minimal Attachment Principle“ (Frazier, 1979), “Recency Preference“ (Gibson et al., 1996)

Preferred construction is more economical, and is easier and faster to process

Constructions that are easier to understand would also be easier to produce, and are produced more often

→ Common mechanism for language production and comprehension leads to faster reading times during parsing and to higher frequencies during production

Page 26: Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies

26

EB

ER

HA

RD

-KA

RL

S-U

NIV

ER

SIT

ÄT

BIN

GE

N

SF

B 4

41

Discussion 3

With the present study we are not able to differentiate between “tuning hypothesis“ and a possible common source.

We showed that production and comprehension are closer to each other than expected.

If correlation between parsing preferences and corpus frequencies holds in general:

→ corpus data can be used to evaluate (at least qualitatively) models of sentence processing