An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of...

20
An Automated Record Linkage System for the Canadian Census, 1871- 1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria) K. Inwood (University of Guelph) J. A. Ross (University of Guelph) Record Linkage Workshop, May 24 th -25 th , 2010, University of Guelph

Transcript of An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of...

Page 1: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

An Automated Record Linkage System for the Canadian

Census, 1871-1881

L. Antonie (University of Guelph)P. Baskerville (Universities of Alberta and Victoria)

K. Inwood (University of Guelph)J. A. Ross (University of Guelph)

Record Linkage Workshop, May 24th-25th, 2010, University of Guelph

Page 2: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

‘Unbiased’ links connecting individuals/households over several

census years

A comprehensive infrastructure of longitudinal data

What we are working towards

1851Census

1871Census

1881Census 1891

Census

1901Census

1906 Census

1916Census

1911Census

US 1880

Census

US 1900

Census

Page 3: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Current Work

100% of 1871

CensusAutomatic LinkingAutomatic Linking

4,277,807 records

3,601,663 records

Partners and collaborators: FamilySearch, Church of Latter Day Saints, Minnesota Population Center, Université de Montréal, University of Alberta

100% of 1871

Census

100% of 1871

Census

100% of 1881

Census

100% of 1871

Census

Page 4: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Existing (True) Links

• Ontario Industrial Proprietors – 8429 links

• Logan Township – 1760 links

• St. James Church, Toronto – 232 links

• Quebec City Boys – 1403 links

• Bias– family- context– others?

Logan Twp

Guelph

Page 5: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Attributes for Automatic Linking

• Last Name - string

• First Name - string

• Gender – binary

• Age - number

• Birthplace - number

• Marital status – single, married, divorced, widowed, unknown

Page 6: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Automatic Linkage

• The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense

• The system:

Page 7: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Data Cleaning and Standardization• Cleaning

– Names – remove non-alpha numerical characters; remove titles

– Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);

– All attributes - deal with English/French notations (e.g. days/jours, married/mariee)

• Standardization– Birthplace codes and granularity– Marital status

Page 8: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Computational Expense

• Very expensive to compare all the possible pairs of records

• Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)

• Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days.

Page 9: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Managing Computational Expense

• Blocking – By first letter of last name– By birthplace

• Using HPC– Running the system on multiple processors

Page 10: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Record Comparison

• Comparing Strings– Jaro-Winkler– Edit Distance– Double Metaphone

• Age– +/- 2 years

• Exact matches – Gender– Birthplace

Page 11: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Classification

• Classifier – Support Vector Machines– 5-fold cross validation

• Training Data– True links found by experts– Ontario proprietors

• Classes– Match– Non-match

Page 12: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Linkage Results

Province Linkage Rate (%)

New Brunswick 24.45

Nova Scotia 21.50

Ontario 18.36

Quebec 17.45

Page 13: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Linkage Results - EvaluationTrue Links Set Total TP (%) FP (%)

Ontario_Props 1647 21.59 9.28

Logan 1760 21.64 8.85

St_James 232 24.72 7.12

Les_Boys 1403 17.99 11.41

Province TP FP Possible Unsure

New Brunswick 66 27 6 1

Nova Scotia 70 22 5 -

Ontario 53 40 5 2

Quebec 42 52 6 -

Page 14: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Linkage Results - EvaluationAttribute ON71 QC71 CAN81 ON_Props Linked(ON) Linked(QC)

Gender Distribution

Female 47.46 49.83 49.35 48.63 45.26 43.50

Male 49.69 50.00 50.64 51.33 54.74 56.50

Age

0-15 42.20 41.84 38.68 60.28 40.96 43.24

15-25 20.12 20.72 21.22 9.44 20.70 22.56

25-50 26.42 25.78 27.68 31.35 26.95 23.07

>50 11.26 11.66 12.42 8.93 11.39 11.13

Birthplace

ON (15030) 67.29 0.57 34.04 73.24 66.30 0.48

QC (15081) 2.45 91.71 30.70 2.40 2.57 92.08

ENG (41000) 7.44 1.11 4.02 6.74 10.00 1.37

IRE (41100) 5.48 0.98 2.75 5.84 5.40 0.94

SCO (41400) 9.35 3.17 4.45 7.33 8.57 2.83

GER (45300) 1.23 0.06 0.56 1.12 2.10 0.07

USA (9900) 2.59 1.23 1.77 2.19 3.96 1.72

Marital Status

Married (1) 30.36 30.22 31.78 39.75 29.11 23.13

Widowed (5) 3.21 3.02 3.66 0.86 4.07 3.64

Single (6) 66.43 66.75 64.52 59.39 66.82 73.24

Page 15: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Directions to Improve

• Common patterns in incorrect links– Big age difference– Change in marital status for females– First name change

• Probability estimate score of the classifier

Page 16: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

BeforeBefore

Results – Common Patterns

AfterAfter

Province Linkage Rate (%)

New Brunswick 24.45

Nova Scotia 21.50

Ontario 18.36

Quebec 17.45

Province Linkage Rate (%) Diff.

NB 22.24 -2.21

NS 18.72 -2.78

ON 15.68 -2.68

QC 14.82 -2.63

Page 17: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Results – Common Patterns

BeforeBefore

AfterAfter

True Links Set Total TP (%) FP (%)

Ontario_Props 1647 21.59 9.28

Logan 1760 21.64 8.85

St_James 232 24.72 7.12

Les_Boys 1403 17.99 11.41

Set TP (%) TPDiff. FP (%) FPDiff.

O_P 20.48 -1.11 7.32 -1.96

L 20.36 -1.28 7.25 -1.6

St_J 23 -1.72 5.92 -1.2

L_B 16.66 -1.33 10.36 -1.05

Page 18: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Results – Classification Scores

0.80.8

0.850.85

0.90.9

22.06 Total TP (%) FP (%)

Logan 1760 19.37 4.86

St_James 232 22.06 3.43

Les_Boys 1403 15.25 5.94

True Links Set Total TP (%) FP (%)

Logan 1760 18.97 4.61

St_James 232 22.06 3

Les_Boys 1403 14.64 5.31

True Links Set Total TP (%) FP (%)

Logan 1760 18.125 3.78

St_James 232 21.63 2.4

Les_Boys 1403 13.94 3.97

Page 19: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Conclusions

• Linking people across 1871-1881 Canadian censuses

• Preliminary automated linkage system

• More evaluation and experimentation is needed

Page 20: An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Acknowledgements

• University of Guelph

• Ontario Ministry of Research and Innovation

• SHARCNET

• FamilySearch, Church of Latter Day Saints

• Minnesota Population Center

• University of Alberta

• Université de Montréal/PRDH

• Université Laval/CIEQ