Kaspar Beelen, Christopher Cochrane, Graeme Hirst, Nona ......Kaspar Beelen, Christopher Cochrane,...

Post on 27-Feb-2021

6 views 0 download

Transcript of Kaspar Beelen, Christopher Cochrane, Graeme Hirst, Nona ......Kaspar Beelen, Christopher Cochrane,...

Digitizing the Canadian Parliamentary Debates

Kaspar Beelen, Christopher Cochrane, Graeme Hirst,Nona Naderi, Ludovic Rheault,Tanya Whyte

University of TorontoDepartment of Computer ScienceDepartment of Political Science

1

Dilipad Project BackgroundDigging into Linked Parliamentary Data (Dilipad)

- Tri-National Project

2

University of Toronto University of Amsterdam Institute of Historical Research

Dilipad Project Background Digging into Linked Parliamentary Data (Dilipad)

- Funded by the Digging into Data Challenge

3

Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format

4

Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format

- The Netherlands: Tweede Kamer and Senaat (1815-present)

5

Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format

- The Netherlands: Tweede Kamer and Senaat (1815-present)

- United Kingdom: House of Commons (1803-present)

6

Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format

- The Netherlands: Tweede Kamer and Senaat (1815-present)

- United Kingdom: House of Commons (1803-present)- Canada: House of Commons (1900-present)

7

Dilipad Project Objectives - Outreach: Lipad.ca

- Indexed and searchable version of the corpus (see the following presentation by Tanya Whyte)

8

Digitization:Overview

9

Digitization:Source Material

10

11

Enrichment:WhySemantic Annotation?

Enrichment:WhySemantic Annotation?

12

Enrichment:WhySemantic Annotation?

13

Enrichment:WhySemantic Annotation?

14

Enrichment:WhySemantic Annotation?

15

Conservative

Liberal

Corpus Structure

16

Corpus Structure

Proceedings

17

Dilipad Scheme

18

Proceedings

Dilipad Scheme

19

Proceedings

20

Proceedings

Dilipad Scheme

21

Proceedings

Dilipad Scheme

22

Proceedings

Dilipad Scheme

23

Proceedings

Dilipad Scheme

Members

24

Dilipad Scheme

Members

25

Dilipad Scheme

Members

26

Dilipad Scheme

Parties

27

Dilipad Scheme

Project Workflow

28

Project Workflow

29

1. OCR Conversion

Project Workflow

30

2.Structuring Text

Project Workflow

31

3. Linking Data

Step 1:OCR

Conversion

32

From PDF to Plain Text

DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House Resumed [...]Mr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]

33

Step 2:Structuring

Text

34

Identifying Patterns

DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]

35

DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]

Identifying Patterns

36

Matching Patterns with Regular Expressions E.g. Wildcards, Canad* = Canada, Canadian,

Identifying Patterns

37

(\n[A-Z]{3,}\n)

DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]

DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]

Identifying Patterns

38(\nMr\.\s[A-Za-z]+\s(.+?):)

Hon. Jean J. Charest (Minister of State (Youth) and Minister of State (Fitness and Amateur Sport and Deputy Leader of the government in the House of Commons)):

Mr. Rompkey:

Issues: Variation

^((?:Sir|M\.|Mr\.|Mr\,|Hon\.|The\sHon\.|Right\sHon\.|The\sRight\sHon\.|Miss|Mrs\.|Ms\.)\s(?:[A-Zdv][\-\w\.\']{1,25}\s{0,1}){1,4}\s{0,1}(?:\(.+?\)){0,1}\s{0,1}(?:(?:moved:)|(?:moved)|:|;))

Etcetera...

Issues: Variation

Issues: Changes over time

1888

1955

Issues: Changes over time

BAILLANiTYNEBALLAINTYNEBALLANT1NEBALLAiNTYNEiBALiLANTYNE

Issues: OCR Errors

Issues: OCR Errors

BAILLANiTYNEBALLAINTYNEBALLANT1NEBALLAiNTYNEiBALiLANTYNE

BALLANTYNE=

DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]

Identifying Patterns

45

for line in document:if preceding line == topic title:

if next line == speech:code line as procedural text

Annotating the Proceedings

46

47

Annotating the Proceedings

48

Annotating the Proceedings

Step 3:Linking

Data

49

Disambiguating Entities

Mr. Jean Francois Pouliot (Temiscouata)

50

Disambiguating Entities

51

Title First Name

Last Name

51

Constituency

Mr. Jean Francois Pouliot (Temiscouata)

Disambiguating Entities

525252

=Mr. Jean Francois Pouliot (Temiscouata)

53535353

=

Adding Information

Mr. Jean Francois Pouliot (Temiscouata)

54545454

=

Adding Information

Mr. Jean Francois Pouliot (Temiscouata)

Project Summary

55555555

Project Output: - Structured and enriched parliamentary corpus. Includes

all House of Commons debates from 1900 to present. - Linked to ParlInfo and other knowledge sources such as

Wikipedia.- Easy-to-use and flexible search engine (Lipad.ca).

Dilipad TeamCanada

- Team: Kaspar Beelen, Chris Cochrane, Graeme Hirst, Nona Naderi, Ludovic Rheault, Tanya Whyte- Interns: Tim Alberdingk-Thijm, Mike Kimmins, Roman Polyanovsky

Netherlands

- Team: Jaap Kamps, Maarten Marx- Other Contributors: Hosein Azarbonyad, Mostafa Denghani, Alex Olieman- Interns: Kees Halvemaan, Sander Lijbrink

United Kingdom

- Team: Jonathan Blaney, Luke Blaxill, Richard Gartner, Paul Seaward, Martin Steer, Jane Winters

56

Funding Agencies- Social Sciences and Humanities Research Council (CAN)- National Sciences and Engineering Research Council (CAN)- Canada Foundation for Innovation (CAN)- Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NED)- Arts and Humanities Research Council (UK)- Economic and Social Research Council (UK)- National Endowment for the Humanities (USA)- National Science Foundation (USA)- Institute of Museum and Library Services (USA)- Joint Information Systems Committee (UK)

57

Questions?

58