Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus...

30
Historical Newspapers A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis on text structure and cohesion in news items from the 17th and 18th centuries Katrin Goldschmidt, Universit¨ at Bonn 10 March 2017 Katrin Goldschmidt DGfS 2017 1/21

Transcript of Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus...

Page 1: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Development and annotation of a newspapercorpus as part of a doctoral thesis on text

structure and cohesion in news items from the17th and 18th centuries

Katrin Goldschmidt, Universitat Bonn

10 March 2017

Katrin Goldschmidt DGfS 2017 1/21

Page 2: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Content

1. Historical NewspapersCharacteristics of historical newspapersObjectives of doctoral thesis

2. A Historical Newspaper CorpusCorpus DevelopmentANNIS

3. SegmentationTypographical SegmentationFunctional Segmentation

Katrin Goldschmidt DGfS 2017 2/21

Page 3: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Characteristics of historical newspapersObjectives of doctoral thesis

Historical Newspapers in the 17th and 18th century

Textual structure

collections of correspondences published by an editor

single correspondences inform about different news stories

3 levels: issue > correspondence > news item(s)

But news stories were written as continuous text...

Katrin Goldschmidt DGfS 2017 3/21

Page 4: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Characteristics of historical newspapersObjectives of doctoral thesis

Historical Newspapers in the 17th and 18th century

Textual structure

collections of correspondences published by an editor

single correspondences inform about different news stories

3 levels: issue > correspondence > news item(s)

But news stories were written as continuous text...

Katrin Goldschmidt DGfS 2017 3/21

Page 5: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Characteristics of historical newspapersObjectives of doctoral thesis

Historical news items in the 17th and 18th centuries

Nordischer Mercurius 1664 Wienerisches Diarium 1767Katrin Goldschmidt DGfS 2017 4/21

Page 6: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Characteristics of historical newspapersObjectives of doctoral thesis

Objectives

Textual Structure

typographical segmentation of text items

super- and macrostructural elements as title, headlines,comments, citations

News Structure

publicistic features as persons, locations, time references

event-related vs. source-related entities

Coherence Structure

entities as phrases vs. parts of phrases

direct and indirect relations within and between news items

Katrin Goldschmidt DGfS 2017 5/21

Page 7: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Characteristics of historical newspapersObjectives of doctoral thesis

Objectives

Textual Structure

typographical segmentation of text items

super- and macrostructural elements as title, headlines,comments, citations

News Structure

publicistic features as persons, locations, time references

event-related vs. source-related entities

Coherence Structure

entities as phrases vs. parts of phrases

direct and indirect relations within and between news items

Katrin Goldschmidt DGfS 2017 5/21

Page 8: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Characteristics of historical newspapersObjectives of doctoral thesis

Objectives

Textual Structure

typographical segmentation of text items

super- and macrostructural elements as title, headlines,comments, citations

News Structure

publicistic features as persons, locations, time references

event-related vs. source-related entities

Coherence Structure

entities as phrases vs. parts of phrases

direct and indirect relations within and between news items

Katrin Goldschmidt DGfS 2017 5/21

Page 9: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

Content

1. Historical NewspapersCharacteristics of historical newspapersObjectives of doctoral thesis

2. A Historical Newspaper CorpusCorpus DevelopmentANNIS

3. SegmentationTypographical SegmentationFunctional Segmentation

Katrin Goldschmidt DGfS 2017 6/21

Page 10: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

From transcription to annotation

Texts

7 german newspaper issues (1609-1767)

from Berlin, Hamburg (2x), Salzburg, Straßburg, Vienna (2x)

segmentation of correspondences into news items by 3 raters=> main corpus

additional corpus: all January issues of Nordischer Mercurius1667 with syntactical annotation (Mercurius corpus)

Corpus size

token: ca. 30.000 [main corpus: 17.784]

sentences: ca. 1.700 [main corpus: 670]

news items: ca. 460 [main corpus: 245]

entities: ca. 5700 [main corpus: 3.384]

Katrin Goldschmidt DGfS 2017 7/21

Page 11: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

From transcription to annotation

Texts

7 german newspaper issues (1609-1767)

from Berlin, Hamburg (2x), Salzburg, Straßburg, Vienna (2x)

segmentation of correspondences into news items by 3 raters=> main corpus

additional corpus: all January issues of Nordischer Mercurius1667 with syntactical annotation (Mercurius corpus)

Corpus size

token: ca. 30.000 [main corpus: 17.784]

sentences: ca. 1.700 [main corpus: 670]

news items: ca. 460 [main corpus: 245]

entities: ca. 5700 [main corpus: 3.384]

Katrin Goldschmidt DGfS 2017 7/21

Page 12: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

From transcription to annotation

Texts

7 german newspaper issues (1609-1767)

from Berlin, Hamburg (2x), Salzburg, Straßburg, Vienna (2x)

segmentation of correspondences into news items by 3 raters=> main corpus

additional corpus: all January issues of Nordischer Mercurius1667 with syntactical annotation (Mercurius corpus)

Corpus size

token: ca. 30.000 [main corpus: 17.784]

sentences: ca. 1.700 [main corpus: 670]

news items: ca. 460 [main corpus: 245]

entities: ca. 5700 [main corpus: 3.384]

Katrin Goldschmidt DGfS 2017 7/21

Page 13: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

Facsimile and Transcription

Berlinische Nachrichten 1741Katrin Goldschmidt DGfS 2017 8/21

Page 14: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

Annotation

Katrin Goldschmidt DGfS 2017 9/21

Page 15: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

Annotation levels

Textual Structure

token

pos

text

makrostr

typosegm

issue

correspondence

origin

meta

News Structure

entity p (phrase)

entity pp (part ofphrase)

how why

episode

indref reftype(func, part, poss)

time reftype(before, during,after)

Coherence Relations

coref target p

coref target pp

indref target

lexref target

ref episode target

ref howwhy target

Katrin Goldschmidt DGfS 2017 10/21

Page 16: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

HistZeitKorp in local ANNIS

entity reference view

Katrin Goldschmidt DGfS 2017 11/21

Page 17: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Corpus DevelopmentANNIS

HistZeitKorp in local ANNIS

discourse view

Katrin Goldschmidt DGfS 2017 12/21

Page 18: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Content

1. Historical NewspapersCharacteristics of historical newspapersObjectives of doctoral thesis

2. A Historical Newspaper CorpusCorpus DevelopmentANNIS

3. SegmentationTypographical SegmentationFunctional Segmentation

Katrin Goldschmidt DGfS 2017 13/21

Page 19: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Segmentation and the role of typographical means

following Lefevre (2013) shifts between paragraphes aremarked by the representation type [.+space(+UpperCase)]

but: no quantitative analysis

Relation 1609 [boundaries: Lefevre (2013); HistNewsKorp]

Katrin Goldschmidt DGfS 2017 14/21

Page 20: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Typographical segmentation is not that easy...

Katrin Goldschmidt DGfS 2017 15/21

Page 21: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Segmentation and the role of functional means

following Schroder (1995) shifts between news items aremarked by change of theme, person, location and time[Relation 1609, Aviso 1609]

over 75% of the news items are introduced by entities thatrespond to journalistic wh-questions(like who, where, when ...? )

A first approach: ”How many news items begin with entities?”

ANNIS search for entities as phrases at the direct beginning ofa news item: text=/bn b/ & entity p & #2 l #1

167 matches

Katrin Goldschmidt DGfS 2017 16/21

Page 22: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Segmentation and the role of functional means

following Schroder (1995) shifts between news items aremarked by change of theme, person, location and time[Relation 1609, Aviso 1609]

over 75% of the news items are introduced by entities thatrespond to journalistic wh-questions(like who, where, when ...? )

A first approach: ”How many news items begin with entities?”

ANNIS search for entities as phrases at the direct beginning ofa news item: text=/bn b/ & entity p & #2 l #1

167 matches

Katrin Goldschmidt DGfS 2017 16/21

Page 23: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Katrin Goldschmidt DGfS 2017 17/21

Page 24: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Also functional segmentation is not that easy...

we see that 64% of the news items begin with an entity thatresponds to the forecited journalistic wh-questions:

A second approach: ”How many news items begin withentities that refer to entities within former news items?”

ANNIS search: 11 matches (11 ÷ 158 ≈ 7.0%)

Katrin Goldschmidt DGfS 2017 18/21

Page 25: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Also functional segmentation is not that easy...

we see that 64% of the news items begin with an entity thatresponds to the forecited journalistic wh-questions:

A second approach: ”How many news items begin withentities that refer to entities within former news items?”

ANNIS search: 11 matches (11 ÷ 158 ≈ 7.0%)

Katrin Goldschmidt DGfS 2017 18/21

Page 26: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Conclusions

typographical means alone are not an adequate means of newsitem segmentation (maybe in 60% of the correspondencesthey are helpful)

at least 7% of phrasal entities at the beginning of news itemsrefer to entities in the preceding news item(s)=> implies no thematic change

but: considering typographical AND functional means couldlead to successive segmentation methods

With help of the Historical Newspaper Corpus we can geta better understanding of the textual and publicistic structure of

these interesting old newspapers.

Katrin Goldschmidt DGfS 2017 19/21

Page 27: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Typographical SegmentationFunctional Segmentation

Conclusions

typographical means alone are not an adequate means of newsitem segmentation (maybe in 60% of the correspondencesthey are helpful)

at least 7% of phrasal entities at the beginning of news itemsrefer to entities in the preceding news item(s)=> implies no thematic change

but: considering typographical AND functional means couldlead to successive segmentation methods

With help of the Historical Newspaper Corpus we can geta better understanding of the textual and publicistic structure of

these interesting old newspapers.

Katrin Goldschmidt DGfS 2017 19/21

Page 28: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

Nordischer Mercurius 1664

”Everything [serves] to the [news] lovers [as] a message and a

pleasure / but to the editor [it serves as] his honest entertainment,which he therefore will seek with God / until his END.“

Katrin Goldschmidt DGfS 2017 20/21

Page 29: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Historical NewspapersA Historical Newspaper Corpus

Segmentation

References

Fritz, G. & E. Straßner (eds.) (1996): Die Sprache der ersten deutschenWochenzeitungen im 17. Jahrhundert. Tubingen.Demske, U. (2007): Das Mercurius-Projekt: eine Baumbank fur dasFruhneuhochdeutsche. In: G. Zifonun & W. Kallmeyer (eds.): Sprachkorpora -Datenmengen und Erkenntnisfortschritt (= Jahrbuch des Instituts fur DeutscheSprache 2006). Berlin, 91-104.Krause, T. & A. Zeldes (2014): ANNIS3: A New Architecture for GenericCorpus Query and Visualization. Literary and Linguistic Computing.Lefevre, M. (2013): Textgestaltung, Außerungsstruktur und Syntax in deutschenZeitungen des 17. Jahrhunderts. Zwischen barocker Polyphonie undsolistischem Journalismus. Berlin.Schroder, T. (1995): Die ersten Zeitungen. Textgestaltung undNachrichtenauswahl. Tubingen.

Special thanks to all who supported me with the newspaperrecords (Osterreiche Nationalbibliothek, Staatsbibliothek zu Berlin,Staats- und Universitatsbibliothek Bremen, Institut fur DeutscheSprache Mannheim) and Prof. Ulrike Demske for providing theMercurius Corpus.

Katrin Goldschmidt DGfS 2017 21/21

Page 30: Development and annotation of a newspaper corpus as part ... · A Historical Newspaper Corpus Segmentation Development and annotation of a newspaper corpus as part of a doctoral thesis

Additional corpus :: NM 1667-01 synt

all January issues of Nordischer Mercurius 1667

integrated syntactical annotation from the Mercurius corpus

Katrin Goldschmidt DGfS 2017 1/1