Annex 2D Data trial comparative results - ex …chris/tmp/20060422/cip-pdfs/Annex 2D...2006/04/22...

Citizen Information Project

Annex 2: Stakeholder processes,systems and data

2D: Data trial comparative results

Citizen Information ProjectFinal Report: Annex 2D:

Data trial comparative results

2

Version Control

Date of Issue 14th June 2005

Version Number 1.0

Version Date Issued by Status

1.0 14/06/05 PJ Maycock Final report



3

Metadata

Coverage UKCreator Office for National Statistics, General

Register Office, Citizen Information ProjectTeam

Date Issued 13/6/05Language EnglishPublisher Office for National Statistics, 1 Drummond

Gate, London, SW1V 2QQStatus Approved by Project ManagerSubject Data quality, sharing and processingSubject.categoryTitle Citizen Information Project: Annex 2D:

Final report: Data trial comparative results



4 Preface

Contents

1. Preface .................................................................................................................. 5

2. Related documents................................................................................................ 5

3. Data trial objectives............................................................................................... 5

4. Scope and methodology........................................................................................ 6

5. Coverage profiles.................................................................................................. 8

6. Critical quality characteristics............................................................................. 10

6.1 Overview ............................................................................................................................ 10

6.2 Scope and size of datasets ............................................................................................. 11

6.3 Duplicate records .............................................................................................................. 11

6.4 Name and address verification....................................................................................... 12

7. Address quality / validity..................................................................................... 13

7.1 Summary............................................................................................................................ 13

7.2 PAF compliance................................................................................................................ 13

7.3 Address matches across postcode demographic ....................................................... 14

7.4 Language impacts ............................................................................................................ 14

7.5 Foreign addresses ............................................................................................................ 14

8. Address cleansing............................................................................................... 15

9. Linking to NLPG.................................................................................................. 15

9.1 Overview of NLPG............................................................................................................ 15

10. Dataset matching - methodology......................................................................... 17

11. Matching by date of birth, name and address details.......................................... 18

11.1 Match results ..................................................................................................................... 18

11.2 Interpretation of results.................................................................................................... 19

11.3 Matches against each stakeholder dataset.................................................................. 21

11.4 Identification of duplicate records .................................................................................. 22

11.5 Family composition .......................................................................................................... 23

11.6 Matching results by demographic.................................................................................. 25

11.7 Influence of address cleansing on overall matching statistics .................................. 26

11.8 Influence of other datasets on matching....................................................................... 26

12. Matching by date of birth and name elements..................................................... 27

13. Analysis of address changes.............................................................................. 27

14. False matches..................................................................................................... 29

14.1 Family composition - matching by date of birth and names ...................................... 31

CIP: Data quality, sharing and processingCitizen Information Project


5 Preface

1. Preface

1.1.1 The Citizen Information Project Final Report recommends the creation of an adultpopulation register that will deliver benefits by sharing basic contact information(name, address, date of birth etc) across the public sector. The reportrecommends that the development of a population register is implemented aspart of the ID Cards Scheme by utilising the National Identity Register (NIR) andthat in the interim a range of short term data sharing initiatives are exploredfurther.

2. Related documents

2.1.1 Annex 2: Stakeholder processes, systems and data comprises of the followingdocuments:

• Annex 2A: Overview

• Annex 2B: Data quality framework

• Annex 2C: Stakeholder profiles

• Annex 2D: Data trial comparative results: This document

• Annex 2E: Data trial comparative results: Appendices

• Annex 2F: Current data sharing across government

• Annex 2G: Other data quality initiatives

2.1.2 This document provides

• A summary of the objectives, scope and methodology of the data trial.

• A summary of the comparative coverage, demographics, quality indexes andmatching of the sample datasets

• Detailed results of the comparative analysis are detailed in Annex 2E: Datatrial: Appendices

• The analysis of each specific dataset is detailed in Annex 2C Stakeholderprofiles and accompanying appendicies.

3. Data trial objectives

3.1.1 The overall objective was to assess the relative and combined quality of basiccontact data held within stakeholders’ operational systems. This incorporatedlooking at the cost effectiveness of cleaning, matching and quality scoringtechniques by using samples of stakeholder data; and assessing the implicationsof applying these techniques to the complete datasets.

3.1.2 To achieve this overall objective, the trial aimed to:



6 Scope and methodology

• Provide further understanding of the characteristics and anomalies of identity(e.g. names, date of birth) and contact details held in stakeholders’operational systems.

• Identify fitness for purpose of records and fields in the individual and mergeddatasets by determining appropriate quality level indicator(s).

• Obtain a statistical assessment of the matching records betweenstakeholders’ datasets. This includes the percentage of records that can beautomatically matched and those that have a reasonable probability of beingmatched and may justify manual inspection.

• Develop best practice guidance on data quality and matching.

4. Scope and methodology

4.1.1 Nine demographics were identified as sample areas, selected by one of threecriteria; name, address and date of birth. The total estimated population acrossthe selected demographics was between 20,000 – 90,000 depending on thedataset and the composition of these demographics were reviewed with the ONSMethodology group.

4.1.2 The sample sets were chosen to ensure that the following demographics wouldbe covered by the trial:

Demographics1 Typical dataset by names2 Typical dataset by names3 Typical suburban dataset by geographic area (postcode and area name)s4 Covers name issues and address issues on houses that have been

converted into flats. (postcode)s5 Covers a rural area in Scotland (postcode)s6 Covers issues around Welsh names and addresses (postcode and area

name)s7 Covers issues related to high density urban areas and high rise flat

blockss8 Dataset by specific date of births9 Covers issues around nominated date of birth being 1st January

4.1.3 The Electoral Roll (2003) contains 83% of the 18+ UK population, and is theclosest and most representative available dataset, (apart from datasets which aremaintained in the private sector), to a comprehensive population register againstwhich other datasets can be compared. Demographics based on the samecriteria as the other datasets have been applied to the Electoral Roll and thecurrent population for each demographic determined. The relative size of theElectoral Roll vs Census 2001 was used to correlate the date of birth profiles ofthe sample data sets with the Census 2001 date of birth profile.

4.1.4 A data sharing protocol was produced and reviewed with the InformationCommissioner to provide a robust framework for the legal, secure andconfidential sharing of personal information for the trial. A fundamental principle



7 Scope and methodology

was that the trial outputs will be anonymous and mainly statistical and that thedata will be destroyed at the end of the trial. The contractor’s data securityprotocols were audited, inspected and approved by ONS.

4.1.5 The following key stakeholders were identified to participate in the trial byproviding sample contact data based on the demographics described above:

• Department for Work and Pensions

• Driver and Vehicle Licensing Agency

• General Register Office

• General Register Office for Scotland.

• HM Revenue and Customs

• National Health Service Information Authority

• United Kingdom Passport Service

4.1.6 Legal vires for data sharing with the stakeholders were agreed; with the exceptionof the DWP and the NHSIA, both of whom subsequently were unable to providesample data to participate in the trial.

4.1.7 A procurement exercise identified Siemens Business Systems as a specialistcontractor with extensive skills in the areas required to perform the technicalaspects of the trial in the most economically effective way.

4.1.8 The participating stakeholders were given the same data-extract specification andprovided sample data-extracts to the specialist contractor. This covered basiccontact details, such as current name, address and date of birth for thepopulation within the selected demographics and, in the case of HM Revenueand Customs, historical names and addresses.

4.1.9 The contractor reported on:

• Detailed analysis of all input data; for addresses this included comparisonwith electoral register dataset and external address datasets (PAF andNLPG)

• Assessment of address data cleansing possible

• Analysis of data matching between the sample datasets

• Development and application of a data quality index methodology

4.1.10 Subsequently the Atkins Technical Team carried out further analysis of theresults and correlation with other information, e.g. database sizes, demographicsand census profiles. An Excel model of all de-personalised data and results wascreated and used to determine and generate:

• Appropriate weightings for each dataset and demographic (s1-s7)

• Comparative profiles for all datasets

• Comparative profiles of demographics within each dataset

• Matched profiles for selected demographics and different match criteria



8 Coverage profiles

5. Coverage profiles

5.1.1 Analysis of the sample datasets by demographic and correlation of these resultsagainst the same demographics from the Electoral Roll enabled coverage profilesagainst date of birth to be generated. These highlighted the following issues:

• DVLA dataset demographics s5 Scotland and s7 Birmingham wereunrepresentative (4% and 6% of expected population compared with otherdemographics 79-87%). The most likely explanation is that the extract of thedata for demographics s3-7 was substantially based on postcodes and thats5 and s7 comprised postcodes containing a padding character, whichinvalidated the extract. As a result of this s5 and s7 were excluded from thecoverage profiles

• HMRC data extract included historical names and addresses and the natureof the data structure resulted in additional citizen records being returned forthose no longer living within the geographical criteria or currently meeting thename criteria for the demographic. The data provided by HMRC containedsufficient information to identify these additional identities and exclude themfrom the analysis.

• Where possible the datasets were modified to exclude all citizens known tobe deceased to provide comparable results to the Electoral Roll / Census2001. However, this information was not available for DVLA, UKPS and maynot be fully current for HMRC datasets.

• There are significant variations of profile between the demographics of thesample datasets, e.g. the age profile varies significantly between s4 Londonand s6 Wales. This was expected as demographics s4-s7 were deliberatelychosen to reflect atypical situations. Weightings were applied at dataset anddemographic level and a sensitivity analysis carried out to ensure the mostacceptable correlation between the sample datasets and other information,e.g. census 2001 profile, the same demographics extracted from theElectoral Roll (representing approx 83% of 18+ population), database sizesrelative to the current population.

5.1.2 The coverage profiles are based on the following parameters:

• 50% records based on typical name demographics s1 and s2

• 20% records based on typical geographical demographic, s3 Bournemouth

• 20% records based on demographic s4 London

• 10% records based on demographic s6 Wales

• dataset weightings to correlate results with actual database sizes obtainedfrom data suppliers (data quality questionnaire).

• DVLA: 94%

• GRO/S: 84%

• HMRC: 108%

• UKPS: 96%

• Electoral Roll / Census 2001: 110%



9 Coverage profiles

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Year of birth

Census 2001

DVLA (Drivers)

GRO + GROS (Births)

HMRC (NIRS2)

UKPS (PASS)

Include emigrants and excludes children

Census is the most accurate estimate of the whole population

HMRC and DVLA do not include children and are greater than

Census due to emigrants.

GRO + GRO(S) includes everyone

born in Scotland from 1974 England and Wales from 1993

UKPS (PASS) includes new and renewed UK passports since 1998 (60% of total UK

passports)

DVLA only includes those with a driving

licence

Coverage profile by date of birth



10 Critical quality characteristics

6. Critical quality characteristics

6.1 Overview6.1.1 CIP completed a review of the existing data held in key public service systems

through a data trial supplemented by a detailed questionnaire. The results aresummarised below (detailed results are given within each stakeholder profile).

Citizenrecords

Estimatedduplicates

Nameverification

Addressverification

Up to dateaddress

Addressvalidity

DfES StudentLoans

5m < 2% High Initiallyhigh > low

Low High

DVLA (Drivers) 40m 0.17% High Nil ~ 62% High

DVLA (Vehicles) 18m #1 - Medium Nil 90 - 95% High

DWP (DCI) 84m ~ 0.07as per NIRS2

Medium Low Medium High

GRO / GROS(Births)

10m 0.66%(GRO)

Notapplicable

Nil Notupdated

Low

HMRC (CID) 60m - Low Nil Medium High

HMRC (NIRS2) 72m 0.07% Low Low Medium High

UKPS (Main) 70m #2 Passportrenewals

High Low ~ 56% High

UKPS (PASS) 24m #2 Passportrenewals

High Low 70% > 56% High

Identity Cards(Requirements)

40 / 48m(adults)

0% High Low 90 – 95% High

Data trial results

Quality questionnaire response

Target

Notes:#1. Of the 30 million records only 18 million vehicles

have individual citizens as owner.#2 Database is passport centric rather than person

centric.




6.2 Scope and size of datasetsCitizenrecords

Comments

DfES StudentLoans

5m

DVLA (Drivers) 40m Active drivers, but includes emigrants and some deceased

DVLA (Vehicles) 18m #1 30 million records, approx. 18m individuals’ names (remainderregistered with organisations)

DWP (DCI) 84.5m 47 million live adult records in UK; 1 million live socialsecurity benefit recipients living abroad; 15 milliondeceased records (date of death verified); 1.5 milliondeceased records (date of death not verified); 5.5million, abroad not in receipt of benefit; 2 million,inactive but not categorised; and 12.5 million childrecords.

GRO / GROS(Births)

10m GRO only available electronically since 1993, GRO(S) since1974

HMRC (CID) 60m

HMRC (NIRS2) 72m Similar to DCI, 6.5m emigrants, 2m inactive and 15mdeceased. No children.

UKPS (Main) 70m #2 Records relate to passports (duplicate records on renewal)

UKPS (PASS) 24m #2 As above, only populated since 1998 (60% of all passportholders)


40 / 48m(adults)

Target of 40m without compulsion, 48m with compulsion

6.3 Duplicate recordsEstimatedduplicates

Comments

DfES StudentLoans

< 2%

DVLA (Drivers) 0.17% Likely to be mainly associated with paper licences

DVLA (Vehicles) -

DWP (DCI) ~ 0.07as per NIRS2

Based on close similarities with NIRS2

GRO / GROS(Births)

0.66% (GRO)

HMRC (CID) No detailsavailable to

CIP

There are 6.7% of citizens records which for a limited periodare duplicated with a temporary and permanent NINO. This ispart of the business process and the use of these temporaryNINOs is being phased out.




Estimatedduplicates

Comments

HMRC (NIRS2) 0.07%

UKPS (Main) Passportrenewals

Legitimate duplicates due to renewals

UKPS (PASS) Passportrenewals


0%

6.4 Name and address verification

Nameverification

Addressverification

DfES StudentLoans

High Initiallyhigh > low

DVLA (Drivers) High Nil

DVLA (Vehicles) Medium Nil

DWP (DCI) Medium Low

GRO / GROS(Births)

Notapplicable

Nil

HMRC (CID) Low Nil

HMRC (NIRS2) Medium Low

UKPS (Main) High Low

UKPS (PASS) High Low


High Low

Verification – supporting documents or processes that confirm the information e.g. nameverified by presentation of passport.

Validation – checking that value is within range or exists, checking address againstPostcode Address File (PAF)

Name verification• Critical to many processes

• Striving for Gold standard

Address verification

• Low quality

• Less onerous, fewer critical processes – niche requirement

• Difficult to e-enable



13 Address quality / validity

7. Address quality / validity

7.1 Summary

7.2 PAF compliance7.2.1 QAS was used to assess the percentages of addresses which are compliant with

PAF, the results of which are shown below:

DVLA GRO GROS HMRC UKPS

94.1 57.6 68 89.6 95.1Percentage of PAF Compliant Addresses by Stakeholder

7.2.2 DVLA and UKPS both achieved a 95% compliance with PAF, which is above the90% matching level at which the Post Office will start offering mailing discounts.However, for the DVLA results only 6% were actually matched as “VerifiedCorrect” as the DVLA generally omits the town name from its address format,which resulted in QAS making an automatic adjustment to the address formatand classifying those records as only a “Good Match”.

7.2.3 In the HMRC dataset, 89% of addresses complied with PAF but this was the onlydata set to include all historical addresses and the overall score for this datasetsuffered from the obsolescent nature of some of its addresses.

7.2.4 GROS data, taken as a whole for births and deaths, reached a compliancepercentage of 68%. This lower figure is caused largely by the relatively highnumber of both tenement addresses in Scottish towns and cities and the numberof rural addresses outside of cities.

Addressvalidity

DfES StudentLoans

High

DVLA (Drivers) High

DVLA (Vehicles) High

DWP (DCI) High

GRO / GROS(Births)

Low

HMRC (CID) High

HMRC (NIRS2) High

UKPS (Main) High

UKPS (PASS) High


High

Generally high quality addresses - effectively 90%(assessed using QAS)

Automatic address cleansing is limited tomarginally improving existing good qualityaddresses

Tentative matches – significant numbers can beresolved rapidly by visual inspection

Application of Unique Property Reference NumberVerification (UPRN-NLPG, now NSAI NationalSpatial Address Infrastructure)

As NLPG validated by more LAs and becomesintegral with other systems, so data quality willimprove



14 Address quality / validity

7.2.5 GRO produced the poorest results having fewer than 58% of raw data addressescomplying automatically with PAF. The GRO result can be attributed to theconcatenated address data in its sample, which QAS had difficulty automaticallymatching to PAF.

7.3 Address matches across postcode demographic7.3.1 The postcode demographics achieving the highest match rates were s3

(Bournemouth) and s4 (London). The results for s3 were not unexpected, giventhat this was a typical suburban area with limited scope for problem addresses.The scores for s4 were expected to be lower than those actually recorded due tothe number of flat conversions in this area. However, it would appear that theformat of flat addresses in s4 did not have a significant impact on QAS’ ability tomatch addresses.

7.3.2 The s7 (Birmingham) demographic achieved results slightly lower than s3 and s4and this lower result was primarily caused by the concentration of high rise towerblock accommodation in this area, the formatting of which did result in QASrecording lower levels of “Verified Correct” and “Good Full” matches.

7.3.3 The lowest match rates were in s5 (Scotland) and s6 (Wales). The s5demographic was particularly adversely impacted by the combination of poorrural address formats and the high number of obsolete addresses resulting from ahousing estate redevelopment whilst the poor results for s6 were primarily due toproblems with rural address formats only. In fact, the good match percentages fors6 were higher than those for s5 due to the lower predominance of ruraladdresses.

7.4 Language impacts7.4.1 Demographic s6 was of a Welsh postcode area which was partly rural. Possible

issues with the use of Welsh language names had been predicted but, apart froma few records where Welsh names had been spelt incorrectly, the use of Welsh inaddress raw data was not a major factor hindering the overall matching process.

7.5 Foreign addresses7.5.1 The level of “Foreign Address” matches was low with figures at around or below

0.1% for all but the HMRC dataset. “Foreign Address” matches for HMRC wereactually reduced because many were given a match type of “Unmatched” withparticular issues around addresses having the country name of Ireland, whichwas not recognised, instead of Eire or the Irish Republic.

7.5.2 Overall, foreign addresses only accounted for 0.3 % of total addresses and didnot have a material impact on address match rates.



15 Address cleansing

8. Address cleansing

8.1.1 The automatic improvement to addresses can only be confidently applied tomatches that already qualify as good or better (i.e. “Verified Correct” and “GoodFull” matches). To maximise the quality of address data and increase the overallfigures for PAF compliance, manual matching will be necessary. The match typegroupings produced by QAS Batch confer confidence levels on the matches itprovides and separate analysis has shown that addresses with match types of“Tentative” and “Partial” offer considerable potential for increasing the overallnumber of address matches through a separate exercise of manual matching.Whilst some of this manual matching can be accomplished quite easily (less thanone minute per record), it has not been possible to accurately assess the totaleffort required to undertake a complete manual review of all records which QAShas not classified as a “Verified Correct” and “Good Full” match.

9. Linking to NLPG

9.1 Overview of NLPG9.1.1 The National Land and Property Gazetteer (NLPG) is a single, comprehensive list

of addresses that was initially generated from Valuation Office records. Thevalidation and maintenance of these addresses has been devolved to each LocalAuthority, who maintain a Local Land and Property Gazetteer, which issynchronised with the NLPG. All the data is held in a common format and eachproperty is assigned a unique property reference numbers (UPRN) andgeographical grid references. These co-ordinates allow individual properties tobe accurately identified within ad hoc boundaries (e.g. Primary Care Trustcatchment areas, and areas defined for Neighbourhood Statistics) usinggeographical information systems and enable dwellings in remote areas to beaccurately located where one postcode might cover a very wide area.

Difficulties in obtaining NLPG data

9.1.2 Obtaining access to the NLPG dataset for use on the CIP trial proved extremelyproblematical. This was primarily due to the difficulties Siemens encountered inobtaining the necessary approvals for the release of this data as licencing issuesmeant that it was not possible to obtain the complete national NLPG dataset andthe local authorities, whose demographic area was covered in the trial, werereluctant to release such data to the CIP trial. As a result, further delays wereencountered and the local authority datasets that were eventually delivered toSiemens and could be used on the trial were restricted to the following:

• Wandsworth

• Bournemouth

• Poole

• Pembrokeshire



16 Linking to NLPG

9.1.3 A major learning point to be carried forward for any similar exercises requiringaccess to NLPG data in the future is that careful consideration may have to begiven to how best to gain access to such data. A separate lobbying process maybe required to win the support and cooperation of local authorities and otherrelevant Government agencies to facilitate the willing release of data by thesebodies in a timely manner. It is hoped that the launch of the NSAI (NationalSpatial Address Infrastructure), which seeks to integrate NLPG, Royal Mail andOS address data, will provide impetus to LA’s validating and using a singleaddress register and the adoption of the UPRN.

Objectives

9.1.4 The sample datasets were matched against the NLPG data using i/Lytics toidentify the level of address matching possible to enable the allocation of UniqueProperty Reference Numbers (UPRN) and compared with similar matching usingQAS to establish if NLPG data might be used to improve the quality of addresses(completeness, consistency, format and validity).

Results

9.1.5 Due to the limited number of available datasets the NLPG data used in the trialonly covered the s3, s4 and s6 samples, and results were limited to thesedemographics. Consequently, the results did not include any matches with theGROS dataset.

9.1.6 The actual matching levels obtained, as a percentage of s3, s4 and s6 data, areshown below.


68.76% 33.38% 69.69% 67.62%

Percentage of address records in s3, s4

9.1.7 Compared with QAS matching levels NLPG matched between 70% -80% ofaddresses in demographics s3, s4 and s6. This could be partly due to NLPG datanot having identical boundaries to postcode areas and some of the demographicsfalling outside the NLPG area.


72.56% 78.53% 74.79% 70.95%NLPG matches as % QAS matches for s3, s4

9.1.8 Stakeholder addresses matched NLPG data in broadly the same proportions asthey were matched by QAS with the single address field format of the GRO dataachieving considerably fewer matches.

9.1.9 The conclusions from the partial NLPG matching is that QAS gives levels ofaddress matching approximately 25% higher. However, these figures should betreated with some caution due to the limited scope of the NLPG analysis resultingfrom the limited amount of NLPG data made available to the trial.



17 Dataset matching - methodology

9.1.10 We recommend that the use of NLPG data (or the subsequent National AddressInfrastructure) and the allocation of a UPRN to all citizen addresses should bepursued, as this will yield significant benefit when sharing data and will limit themanual matching effort to the initial allocation.

9.1.11 Currently 81% Local Authorities, in the England, Scotland and Wales, havevalidated their LLPG data and 55% of LAs are actively maintaining this data.Assuming that this initiative continues across all LAs and that LAs, as they adoptCRM solutions, will use their LLPG data across all their applications, then thequality of this data will significantly improve and achieve a level similar to PAF.

10. Dataset matching - methodology

10.1.1 The raw datasets (175,000 records) were rationalised into a common format andwhere alternative or historical names and addresses existed these wereconverted into 145,000 additional records (i.e. a record was created for eachcombination of name and address in the original record).

10.1.2 All datasets were then matched using the i/Lytics tool using the ranking criteriadescribed in Appendix 3.10 which utilised all the primary data items (includingdate of birth, names, and addresses).

10.1.3 The i/Lytics system sorts and compares all the records using exact and fuzzymatching and utilises heuristic rules related to abbreviations and permutations ofname and address elements. Groups with similar records, called “families”, arecreated and the record with the most complete information is identified as“parent” and all the other records in the family termed ‘members’. Each memberis compared against the parent and the type of similarity between the parent andeach record is termed the “rank” of the match and is a complex combination ofmatching rules associated with each data item. For more details refer to Appendix3.10. ‘Automatic’ ranks are those where the similarity between two records ishigh enough that the records can be considered duplicates without any furtheranalysis or inspection. An initial automatic match rate of 25% was achieved withexact matching and subsequently enhanced to 49% with the inclusion of fuzzymatching and optimisation of the ranks yielding satisfactory results and very lowprobabilities of false matching.

10.1.4 Each family group is then de-duplicated using the unique id allocated to the rawrecords, i.e. this re-combines permutations of name and address, but ensuresthat matching has been achieved utilising all these permutations. Family groupsmay then be classified as:

• Parent records with no children: Original records do not match any others

• Parent records with children from different datasets: legitimate matches

• Parent records with more than one child from the same dataset: Potentialduplicate records, i.e. the same date of birth, name and address but withdifferent stakeholder id (NINO, licence, etc).



18 Matching by date of birth, name and address details

10.1.5 The members within each family are then analysed and a ‘family composition’report generated identifying the combinations of matching. These results areaggregated to give match rates for all combinations of datasets.

10.1.6 In addition to matching on all primary fields, the process has been repeated usingmore relaxed matching criteria:

• date of birth + names

• date of birth + surname

10.1.7 From these additional matches the following can be derived:

• extent of identities (i.e. matching on date of birth and names) with differentaddresses

• extent of missed identity matches – by broadening the criteria to date of birthand surname

• some indication of false matches by inspecting the occurrences within eachmatch group (family comosition)

• ‘no match’ records – those that will never match, e.g. citizen with only adriving licence and no passport. There are a limited number of scenarios notconsidered, e.g. change of name due to marriage / divorce, but these are notlikely to be significant (i.e. number of marriages / divorces in a year isrelatively small to total population).

11. Matching by date of birth, name and addressdetails

11.1 Match results11.1.1 Number of stakeholder records matched as a percentage of all stakeholder

records (considering all datasets and demographics, without any weightings)

StakeholderAll

datasetsDVLA

GRO

(B+D)

GROS

(B+D)HMRC UKPS

Births

(GRO+

GROS)

Deaths

(GRO+

GROS)

All records 175,268 39,004 12,969 5,187 93,580 24,528 11,428 6,728

Matchedrecords

84,646

(48%)

27,123

(69%)

4,371

(33%)

902

(17%)

34,152

(36%)

18,098

(73%)

2,226

(19%)

3,047

(45%)

DVLA 13 981 74 25550 9114 38 1017

GRO (B+D) 57 0 2223 1867

GROS (B+D) 27 695 194

HMRC 34 14358 210 2708

UKPS 98 1952 109

Births(GRO+GROS)

60




11.1.2 In the above table a record refers to a person with a unique id (e.g. NINO, licenceno, etc), except in the case of UKPS where it refers to a passport no. whichchanges on renewal.

41%85%59%

UKPS

15%HMRC

HMRC UKPS

34%73%66%

DVLA

27%HMRC

HMRC DVLA

77%63%23%

DVLA

37%UKPS

UKPS DVLA

11.2 Interpretation of results11.2.1 It is important to recognise that the match percentages reported are more heavily

influenced by the nature of the datasets than by the efficacy of the matchingprocess, e.g. children in UKPS dataset can never be matched to DVLA datawhich applies only to over 16s. e.g. UKPS matched against DVLA.

11.2.2 The following match profiles were based on the same dataset and demographicweightings used to analyse comparative coverage. The match percentagesbetween all unweighted datasets and demographics is not significantly differentto the following.

UKPS (PASS)

Automatic matches

0

50

100

150

200

250

300

350

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Year of birth

No.

of r

ecor

ds in

sam

ple

data

set

Census 2001

Grey match (dob, surname)

Full match (dob, name, address)

DVLA (Drivers)

UKPS (PASS)




Matching between DVLA (Drivers) and UKPS (PASS) datasets

No matches i.e.

passport holders (inPASS) without drivers

licence

37% UKPS (PASS)records automaticallymatch DVLA records

No matches i.e. driverswithout records in PASS

database

23% DVLA recordsautomatically match

UKPS (PASS) records

DVLA drivers




11.3 Matches against each stakeholder dataset

-

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

DVLA GRO (BIRTHS+DEATHS) GROS (BIRTHS+DEATHS) IR UKPS

Unique IDs in Input Unique IDs in Match Families

11.3.1 The above graph shows the numbers of input records and match records per stakeholder and gives an indication for the percentage match rate ofeach stakeholder against all records. These figures are discussed below.

11.3.2 As can be seen, the matching levels within the merged dataset revealed a sizeable disparity between stakeholders with far higher percentage matchrates from DVLA and UKPS of 69.54% and 73.39% respectively. Matching levels for birth and death records were substantially lower whilst HMRCrecords, although having more records matched than any other stakeholder, only matched 36.49% of distinct records.




11.3.3 The disparity of matching levels between stakeholders can be attributed to anumber of identifiable factors specific to one or more demographic as listedbelow.

Dataset Factors with a positive influenceon matching rates

Factors with a negative influenceon matching rates

DVLA • High level of PAF compliantaddresses.

• Current and updated data

• Not all citizens have a drivinglicence

• Not applicable to under 16s

UKPS • High level of PAF compliantaddresses.

• No data over six years old, i.e.prior to 1998

• Not all citizens have apassport

• Only 60% of passport holderson this database

HMRC • Large coverage • Not applicable to under 16s• Temporary residents working

in the UK• Older data now obsolete e.g.

deaths predated otherstakeholder data

• Older data now outside ofsampled demographics e.g.Person living in s3 andmoving before creation ofother stakeholders’ datasets

GRO • Persons born before 1993 notin dataset

• Poor PAF compliance due toconcatenation of addresseselements

• Birth data on children tooyoung to appear in other data

GROS • Persons born before 1973 notin dataset

• Low PAF compliance due tomore complex nature ofScottish addresses

• Low numbers of people in theScottish postcode s5demographic in otherstakeholders

• Birth data on children tooyoung to appear in other data

11.4 Identification of duplicate records11.4.1 The number of matched family records per dataset is shown in the chart below

with a count showing number of matches within a dataset. For example, there are




65 matches of identity within the GRO dataset and 1 example of four UKPSrecords in the same match family.

11.4.2 From inspection all these records (except for UKPS where a record is related to apassport rather than a citizen and indicate passport renewals) are duplicaterecords, i.e. a person having more than one unique id within a dataset.

1

2

3

4

DV

LA

IR UK

PS

GR

O (B

IRT

HS

+DE

AT

HS

)

GR

OS

(BIR

TH

S+D

EA

TH

S)

BIR

TH

S (G

RO

+GR

OS

)

DE

AT

HS

(GR

O+G

RO

S)

26,929

97

34,012

70

17,072

493

12

1

3,0472,160

33

846

28

4,241

65

-

5,000

10,000

15,000

20,000

25,000

30,000

35,000

PrevalenceCount

11.5 Family composition

11.5.1 The following graph identifies the matches between different datasets

CIP: Data quality, sharing and processing




23456

DV

LA O

nly

GR

O O

nly

GR

OS

Onl

y

IR O

nly

UK

PS

Onl

y

DV

LA, G

RO

DV

LA, G

RO

S

DV

LA, I

R

DV

LA, U

KP

S

GR

O, I

R

GR

O, U

KP

S

GR

OS

, IR

GR

OS

, UK

PS

IR, U

KP

S

DV

LA, G

RO

, IR

DV

LA, G

RO

, UK

PS

DV

LA, G

RO

S, I

R

DV

LA, I

R, U

KP

S

GR

O, I

R, U

KP

S

GR

OS

, IR

, UK

PS

DV

LA, G

RO

S, I

R, U

KP

S

DV

LA, G

RO

, IR

, UK

PS

DV

LA, B

IRTH

S

DV

LA, D

EA

THS

BIR

THS

, DE

ATH

S

BIR

THS

, IR

BIR

TH

S, U

KP

S

DE

AT

HS

, IR

DV

LA, B

IRTH

S, I

R

DV

LA, D

EA

TH

S, I

R

DV

LA, D

EA

THS

, UK

PS

BIR

TH

S, D

EA

TH

S, I

R

BIR

TH

S, I

R, U

KP

S

DE

AT

HS

, IR

, UK

PS

DV

LA, D

EA

TH

S, I

R, U

KP

S

DV

LA, B

IRT

HS

, IR

, UK

PS

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

FamilySize

The composition of families by stakeholders.

11.5.2 The above confirms that when matching on all primary fields, the occurrence of ‘false matches’ is negligible and due solely to duplicate identities.

CIP: Data quality, sharing and processing


25 Error! No text of specified style indocument.

11.6 Matching results by demographic11.6.1 The matching results obtained by demographic split are shown below:

-

10,000

20,000

30,000

40,000

50,000

60,000

s1 s2 s3 s4 s5 s6 s7 s8 s9

Unique IDs in Input Unique IDs in Match Families



26 Matching by date of birth, name andaddress details

11.6.2 Across demographics the match level percentages for unique id records areshown in the table below:

Demographic Total Matched Match %

All datasets and demographics 175,268 84,646 48%s1 Surname beginning “XXX” 12,162 6,959 57%s2 Surname beginning “YYY” 9,579 5,989 63%s3 Bournemouth 32,087 18,220 57%s4 London 50,482 23,076 46%s5 Scotland 14,589 3,816 26%s6 Wales 12,692 8,699 69%s7 Birmingham 26,908 7,450 28%s8 DOB - Random 6,471 4,200 65%s9 DOB – 1/1 from mid 70s 10,298 6,237 61%

11.6.3 The results for s5 and s7 reflect the very low numbers of records retrieved fromthe DVLA drivers database for those demographics and should be disregarded.

11.6.4 The consistency of data over time is also comparable with matching levels. Thedate of birth demographics s8 and s9, based on data that should never change,show a greater matching percentage with s8 levels higher than s9. This ispossibly due to dates of birth given as first of January not being consistently usedelsewhere. The s1 and s2 demographics are based on fairly consistent namedata but changes in surname will reduce the number of matches. Address datafor an individual can change often which reduces matching levels. Areas such ass3 and s6, which could be expected to have a more static population, show muchbetter matching levels.

11.6.5 For example, date of birth demographics s8 and s9 show a higher matchingpercentage than any other demographic type. This may be due the date of birthbeing static through a person’s lifetime when address date and, even name data,can be prone to change. There is higher percentage of s8 records matched thans9 which may indicate birth dates of 1st of January are often guessed orapproximated and are not used consistently by people.

11.7 Influence of address cleansing on overall matchingstatistics

11.7.1 QAS address cleansing had minimal effect on increasing matches. Removingaddress data from matching criteria increased matches by just over 5%, from48.29% to 53.41%, indicating that address quality was not hugely significant insecuring matches due to the overall good quality of addresses in the DVLA andUKPS datasets.

11.8 Influence of other datasets on matching11.8.1 The matching of all the datasets by date of birth, names and addresses was

repeated with CACI Enhanced Electoral Roll data included. This resulted in an



27 Matching by date of birth and nameelements

increased match rate of 7% for the HMRC dataset, 3% for DVLA, 1.5% for UKPSand nominal effect on GRO / GROS.

12. Matching by date of birth and name elements

12.1.1 Relaxation of the matching criteria to exclude address details results in almost10% more matches than previously. However, some measure of the falsematches occurring may be derived from the family composition diagram wherethere is a small increase in the occurrences of families with more members thanshould be expected, e.g. where matching occurs between DVLA, UKPS and IRthere are 6 members in a family of size 4 indicates that there are 6 x (4-3) = 6false records.

12.1.2 By further relaxing the criteria to just date of birth and surname, the increase ofmatches will include any missed matches in the previous analyses, but there willbe more false matches. This gives an indication of the grey area of matching forthis sample size, i.e. the difference between the records that conclusively match(based on extensive criteria), and people that are unlikely to ever match (dob andsurname are unique) e.g. they only have a driving licence and no passport.

12.1.3 The difference between matching by date of birth and names vs date of birth andjust surname showed only a small difference. This is likely to be due to the smallsize of the data samples.

12.1.4 This result cannot be directly extrapolated to a large dataset as if there may wellbe only one Smith born on a specific day in a dataset of 100 members, but therewill be a number of Smiths born on that day in a dataset of 10 million. However,from analysis of surname and date of birth statistics it is known that within the UKpopulation 90% of people have a unique combination of date of birth andsurname. Thus the extrapolated ‘no match’ result cannot fall below 90% of theextrapolated value.

12.1.5 This enables a matching percentage to be derived, which is only related to theefficacy of the match and not skewed by members who will never match.

13. Analysis of address changes

The following results were obtained for the limited and weighted demographics / datasetsused in the coverage profiling:



28 Analysis of address changes

Dob + Name + Address DVLA IR UKPS

No % No % No % AdjustedAll records 11,941 16,036 6,560 DVLA 8,266 51.5% 3,164 48.2% 80.4%IR 8,266 69.2% 0.0% 3,864 58.9% 98.2%UKPS 3,164 26.5% 3,864 24.1% GRO Dob + Surname DVLA 9,443 58.9% 3,464 52.8% 88.0%

IR 9,443 79.1% 4,244 64.7% 107.8%UKPS 3,464 29.0% 4,244 26.5% GRO People with different addresses DVLA 1,178 7.3% 300 4.6% 7.6%IR 1,178 9.9% 380 5.8% 9.7%UKPS 300 2.5% 380 2.4% As a % of matched addresses DVLA 12.5% 8.7% 14.4%IR 12.5% 9.0% 14.9%UKPS 8.7% 9.0% UKPSadjusted 14.4% 14.9%

Results give the number of different addresses as between 9-15% of matched records.These represent the records shown in the diagram below:

The unknown remains the number of records where both databases hold out of dateaddresses.

Drivers andpassport holders

with currentaddress in both

databases

Passport holders witha current address

Drivers with acurrent address

Drivers and passportholders with old address

in both databases


(DVLA) and currentaddress (UKPS)


(UKPS) and currentaddress (DVLA)

Drivers with old address

Passport holders withold address

UKPS

DVLA

Drivers and passport holderswith old and current

address

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

UKPS

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

DVLA(Drivers)



29 False matches

14. False matches

Inspection of the family composition results show an increased number of false matchesas expected.

Match by DoB +surname

Matches byDoB + name

(fuzzy)

Matches byDoB +

surname

DVLAMatches + falsematches 46% 43% 38%

No matches 54% 57% 62%

UKPSMatches + falsematches 29% 27% 24%

No matches 71% 73% 76%

The combination of false and missed matches as the match criteria is relaxed isillustrated in the following diagram:



30 False matches

Matching by dob +name + address

James Doe20 High St01/01/1950

ID 1000UKPS

James Doe20 High St01/01/1950

ID 1000DVLA

Sue Doe12 Bridge St01/01/1950

ID 1002DVLA

John Jones80 Main St01/01/1950

ID 1004DVLA

Ann Doe10 Kings Rd01/01/1950

ID 1003UKPS

John Doe12 Bridge St01/01/1950

ID 1001DVLA

Automaticmatch 1

No matches1- 6

Matching by dob +name

Automaticmatch 1

Matching by dob +surname

Automaticmatch 1 No match 1

NameAddress

Date of birthID no

Stakeholder

Susan Doe12 Bridge St01/01/1950

ID 1002UKPS

Correct Correct Missed Missed Missed Correct Correct

Correct Correct Correct Missed Missed Correct Correct

No matches1- 4

Correct Correct False match False match False match False match Correct

John Doe20 High St01/01/1950

ID 1001UKPS

Missed

Correct

False match

Automaticmatch 2

Proportion remainsunchanged as sample

is scaled upGrey matches - ratio of missed / false / correct matches

varies as sample is scaled up

% of population with full matchand who hold a passport and

drivers licence, i.e. matchcriteria is so strict that no false

matches exist

Proportion reduces as sample is scaled up but can never go

below minimum of % populationwith unique combination of dob

and surname and which holdeither a passport or a drivers

licence (ie ratio ofpassports:drivers)



31 False matches

2

5

8

DV

LA O

nly

GR

O O

nly

GR

OS

Onl

y

IR O

nly

UK

PS

Onl

y

DV

LA, G

RO

DV

LA, G

RO

S

DV

LA, I

R

DV

LA, U

KP

S

GR

O, I

R

GR

O, U

KP

S

GR

OS

, IR

GR

OS

, UK

PS

IR, U

KP

S

DV

LA, G

RO

, IR

DV

LA, G

RO

, UK

PS

DV

LA, G

RO

S, I

R

DV

LA, I

R, U

KP

S

GR

O, I

R, U

KP

S

GR

OS

, IR

, UK

PS

DV

LA, G

RO

S, I

R, U

KP

S

DV

LA, G

RO

, IR

, UK

PS

DV

LA, B

IRT

HS

DV

LA, D

EA

TH

S

BIR

TH

S, D

EA

TH

S

BIR

TH

S, I

R

BIR

TH

S, U

KP

S

DE

AT

HS

, IR

DV

LA, B

IRT

HS

, IR

DV

LA, D

EA

TH

S, I

R

DV

LA, D

EA

TH

S, U

KP

S

BIR

TH

S, D

EA

TH

S, I

R

BIR

TH

S, I

R, U

KP

S

DE

AT

HS

, IR

, UK

PS

DV

LA, D

EA

TH

S, I

R, U

KP

S

DV

LA, B

IRT

HS

, IR

, UK

PS

-

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

FamilySize

14.1 Family composition - matching by date of birth and names14.1.1 This analysis identifies the increased level of matching and the occurrence of a small number of false matches as a result of relaxing the

matching criteria to exclude address.

Annex 2D Data trial comparative results - ex …chris/tmp/20060422/cip-pdfs/Annex 2D...2006/04/22...

Documents

Transcript of Annex 2D Data trial comparative results - ex …chris/tmp/20060422/cip-pdfs/Annex 2D...2006/04/22...