Homing in on the Text-Initial Cluster

32
Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguis tics

description

Homing in on the Text-Initial Cluster. Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics. Starting Questions. - PowerPoint PPT Presentation

Transcript of Homing in on the Text-Initial Cluster

Page 1: Homing in on the Text-Initial Cluster

Homing in on the Text-Initial Cluster

Mike ScottSchool of English

University of LiverpoolAston Corpus Symposium

Friday May 4th 2007 This presentation is at

www.lexically.net/downloads/corpus_linguistics

Page 2: Homing in on the Text-Initial Cluster

Starting Questions

1. Are clusters like “Once upon a time” and “lived happily ever after” oddities in marking text position?

2. Or do many n-grams characterise the beginnings, middles or ends of certain kinds of text?

3. If so, are there any common patterns in text-initial clusters?

Page 3: Homing in on the Text-Initial Cluster

Context

Textual Priming Project, University of Liverpool

Michael HoeyMichaela MahlbergMatthew O’DonnellMike Scott

Page 4: Homing in on the Text-Initial Cluster

Textual Priming Project: Aims to investigate how many (and what types of)

lexical items are primed to appear in text-initial or paragraph-initial position

to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts.

to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis.

from O

’Donnell

et al 2007

Page 5: Homing in on the Text-Initial Cluster

Hard News Corpus

“Home News” sections of the Guardian and Observer 1998 to 2004 115,654 articles divided thus:

headline & lead 1st sentence of 1st paragraph (TISC) all other sentences

TISC contains 3.2 million tokens The rest: 51.2 million tokens About 470 words per article

Page 6: Homing in on the Text-Initial Cluster

Research Questions

Using the hard news corpus,

1. How many 3-5 word clusters are found to be key in TISC sections?

2. How many are positively and how many are negatively key?

3. What recurrent patterns can be found in the two types of key cluster?

Page 7: Homing in on the Text-Initial Cluster

Methods (1)1. Format the corpus in XML and

separate out all TISC sections (done by Matt O’Donnell)

2. Use WordSmith’s WordList tool to compute wordlist indexes of

1. all the text

2. all the TISC sections

3. Using WordList, compute 3-5 word clusters for each index, save as .lst

Page 8: Homing in on the Text-Initial Cluster

Top clusters, all sections

GUARDIAN CO UKONE OF THEA HREF HTTP, WWW GUARDIAN CO and similar web

linksTHE PRIME MINISTERTHE END OFAS WELL ASTHE NUMBER OFTHERE IS ASOME OF THETHERE IS NO

Page 9: Homing in on the Text-Initial Cluster

Top clusters, TISC

ONE OF THEACCORDING TO ALAST NIGHT AFTERFOR THE FIRSTTHE FIRST TIMEIS TO BEFOR THE FIRST TIMETHE MURDER OFARE TO BETHE DEATH OF

OF THE MOSTTHE HOME SECRETARYWAS LAST NIGHTIT EMERGED YESTERDAYAS PART OFAN ATTEMPT TOTHE UNITED STATESTHE NUMBER OFONE OF THE MOSTACCORDING TO THE

Page 10: Homing in on the Text-Initial Cluster

Methods (2)4. Use KeyWords tool to compute KWs

for the TISC 3-5 word clusters using all the text as a reference corpus

5. Identify patterns in the KW clusters

Page 11: Homing in on the Text-Initial Cluster

TISC key clusters

ACCORDING TO ALAST NIGHT AFTERIT EMERGED YESTERDAYWAS LAST NIGHTARE TO BETHE MURDER OFLAST NIGHT WHENTHE GOVERNMENT YESTERDAYLAST NIGHT ASIS TO BE

WERE LAST NIGHTYESTERDAY AFTER ATONY BLAIR YESTERDAYCOURT HEARD YESTERDAYWAS TOLD YESTERDAYWAS JAILED FORTHE DEATH OFYEAR OLD BOYYESTERDAY WHEN THEWITH THE MURDER OF

Page 12: Homing in on the Text-Initial Cluster

Numbers of Key Clusters

Page 13: Homing in on the Text-Initial Cluster

RQs 1 & 2: Numbers of KW clusters

using a p value of 0.0000001 and minimum frequency of 3 and log likelihood statistic,

8,132 key clusters altogether (in 3.2 million words of text)

of which 7,631 were positively key and 501 negatively key

though there is repetition as these are 3-5 word n-grams

Research Question 2

Page 14: Homing in on the Text-Initial Cluster

RepetitionYESTERDAY FOUND GUILTYYESTERDAY FOUND GUILTY OFYESTERDAY FROM AYESTERDAY FROM THEYESTERDAY GAVE AYESTERDAY GAVE HISYESTERDAY GAVE THEYESTERDAY GIVEN AYESTERDAY GIVEN THEYESTERDAY GIVEN THE GOYESTERDAY GIVEN THE GO AHEAD

Page 15: Homing in on the Text-Initial Cluster

Negatively key:

A LOT OFA SPOKESMAN FORTHERE IS NOHE SAID THESAID IT WASTHERE IS ATHIS IS ATHE FACT THATAS WELL ASIT WOULD BE

SPOKESMAN FOR THEPER CENT OFWE HAVE TOSAID THAT THEBUT IT ISAT A TIMEA SPOKESMAN FOR THESAID HE WASIT IS NOTTHERE WAS NO

Page 16: Homing in on the Text-Initial Cluster

RQ 1: Numbers of KW clusters

Is 8 thousand a large number of distinct key text-initial clusters?

In the same amount of text there are 84 thousand 3-5 word clusters of frequency at least 5 altogether…

about one in 10 is associated with text initial position at the .0000001 level of significance

Page 17: Homing in on the Text-Initial Cluster

RQ 1, continued

… is 1 in 10 a large number to be key? In the case of SISC (sentences from

paragraphs with only one sentence in), we get

507 thousand clusters, of which 2,192 are key (1,747 positively and 445

negatively) which is about 1 in 230

Page 18: Homing in on the Text-Initial Cluster

PATTERNS

Page 19: Homing in on the Text-Initial Cluster

RQ 3: patterns

recency: in the top 200, seventy express time,

generally using yesterday or last night

Page 20: Homing in on the Text-Initial Cluster

Recency clusters

COURT HEARD YESTERDAYTONY BLAIR YESTERDAYYESTERDAY AFTER AWERE LAST NIGHTLAST NIGHT ASTHE GOVERNMENT YESTERDAYLAST NIGHT WHENWAS LAST NIGHTIT EMERGED YESTERDAYLAST NIGHT AFTER

YESTERDAY IN AIT EMERGED LAST NIGHTA COURT HEARD YESTERDAYYESTERDAY WHEN AYESTERDAY AFTER THEEMERGED LAST NIGHTLAST NIGHT TOYESTERDAY AS THEYESTERDAY WHEN THEWAS TOLD YESTERDAY

Page 21: Homing in on the Text-Initial Cluster

Superlatives

ONE OF BRITAIN'S MOST

ONE OF THE MOST

OF THE WORLD'S

THE FIRST TIME

OF BRITAIN'S MOST

FOR THE FIRST

FOR THE FIRST TIME

Page 22: Homing in on the Text-Initial Cluster

Research, Report etc.

ACCORDING TO A REPORTA COURT HEARD (YESTERDAY)ACCORDING TO RESEARCHTO A SURVEYIT EMERGED LAST NIGHTIT WAS ANNOUNCED YESTERDAYIT WAS REVEALED YESTERDAYA REPORT PUBLISHEDACCORDING TO A STUDYTO RESEARCH PUBLISHED

Page 23: Homing in on the Text-Initial Cluster

Attention-grabbers

IT EMERGED THAT

OBSERVER CAN REVEAL

THE OBSERVER CAN REVEAL

Page 24: Homing in on the Text-Initial Cluster

Indefinite articles positively key….

A BABY GIRLA BAN ONA BEACH INA BID TOA BITTER ROWA BLACK MANA BLISTERING ATTACK ONA JURY WAS TOLD YESTERDAY

A LABOUR MPA LANDMARK RULINGA LAST DITCH ATTEMPT TOA LAST MINUTEA LEADING BRITISHA LEADING SCIENTISTA LEGAL BATTLEA LEGAL CHALLENGE

Page 25: Homing in on the Text-Initial Cluster

Indefinite articles negatively key

A KIND OF

A COUPLE OF

A GREAT DEAL

A KIND OF

A LOT MORE

Page 26: Homing in on the Text-Initial Cluster

IT + reporting verb – positively key

IT WAS ANNOUNCED LAST NIGHT

IT WAS CLAIMED LAST NIGHT

IT WAS CONFIRMED LAST NIGHT

IT IS REVEALED TODAY

Page 27: Homing in on the Text-Initial Cluster

IT otherwise negatively key:

IT IS A

IT IS ABOUT

IT IS EXPECTED

IT IS GOING

IT IS ONLY

IT IS POSSIBLE

IT SEEMS TO

Page 28: Homing in on the Text-Initial Cluster

SAID YESTERDAY – positively key

SAID YESTERDAY AFTER

SAID YESTERDAY THAT HE

SAID YESTERDAY THEY HAD

Page 29: Homing in on the Text-Initial Cluster

SAID without time – negatively key

SAID AT THE

SAID HE HAD

SAID HE WOULD

SAID THE GOVERNMENT

SAID THERE WAS NO

Page 30: Homing in on the Text-Initial Cluster

Conclusions

The “once upon a time” syndrome seems to be much more common than might be thought.

In text-initial sections of 115 thousand hard news stories (3.2 m. words), out of 8 thousand 3-5 word clusters, about 1 in 10 had text-initial significance

whereas in non text-initial sections only 1 in 230 was key

Page 31: Homing in on the Text-Initial Cluster

Other patterns

recency superlatives research, report attention-grabbers indefinite articles IT + reporting verb; SAID + time

Page 32: Homing in on the Text-Initial Cluster

O’Donnell, Matthew, Mike Scott, Michaela Malhberg & Michael Hoey (forthcoming) ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics. Paper presented at PALC, Łodz.. April 2007.

References