Annex 2D Data trial comparative results - ex …chris/tmp/20060422/cip-pdfs/Annex 2D...2006/04/22...
Transcript of Annex 2D Data trial comparative results - ex …chris/tmp/20060422/cip-pdfs/Annex 2D...2006/04/22...
Citizen Information Project
Annex 2: Stakeholder processes,systems and data
2D: Data trial comparative results
Citizen Information ProjectFinal Report: Annex 2D:
Data trial comparative results
2
Version Control
Date of Issue 14th June 2005
Version Number 1.0
Version Date Issued by Status
1.0 14/06/05 PJ Maycock Final report
Citizen Information ProjectFinal Report: Annex 2D:
Data trial comparative results
3
Metadata
Coverage UKCreator Office for National Statistics, General
Register Office, Citizen Information ProjectTeam
Date Issued 13/6/05Language EnglishPublisher Office for National Statistics, 1 Drummond
Gate, London, SW1V 2QQStatus Approved by Project ManagerSubject Data quality, sharing and processingSubject.categoryTitle Citizen Information Project: Annex 2D:
Final report: Data trial comparative results
Citizen Information ProjectFinal Report: Annex 2D:
Data trial comparative results
4 Preface
Contents
1. Preface .................................................................................................................. 5
2. Related documents................................................................................................ 5
3. Data trial objectives............................................................................................... 5
4. Scope and methodology........................................................................................ 6
5. Coverage profiles.................................................................................................. 8
6. Critical quality characteristics............................................................................. 10
6.1 Overview ............................................................................................................................ 10
6.2 Scope and size of datasets ............................................................................................. 11
6.3 Duplicate records .............................................................................................................. 11
6.4 Name and address verification....................................................................................... 12
7. Address quality / validity..................................................................................... 13
7.1 Summary............................................................................................................................ 13
7.2 PAF compliance................................................................................................................ 13
7.3 Address matches across postcode demographic ....................................................... 14
7.4 Language impacts ............................................................................................................ 14
7.5 Foreign addresses ............................................................................................................ 14
8. Address cleansing............................................................................................... 15
9. Linking to NLPG.................................................................................................. 15
9.1 Overview of NLPG............................................................................................................ 15
10. Dataset matching - methodology......................................................................... 17
11. Matching by date of birth, name and address details.......................................... 18
11.1 Match results ..................................................................................................................... 18
11.2 Interpretation of results.................................................................................................... 19
11.3 Matches against each stakeholder dataset.................................................................. 21
11.4 Identification of duplicate records .................................................................................. 22
11.5 Family composition .......................................................................................................... 23
11.6 Matching results by demographic.................................................................................. 25
11.7 Influence of address cleansing on overall matching statistics .................................. 26
11.8 Influence of other datasets on matching....................................................................... 26
12. Matching by date of birth and name elements..................................................... 27
13. Analysis of address changes.............................................................................. 27
14. False matches..................................................................................................... 29
14.1 Family composition - matching by date of birth and names ...................................... 31
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
5 Preface
1. Preface
1.1.1 The Citizen Information Project Final Report recommends the creation of an adultpopulation register that will deliver benefits by sharing basic contact information(name, address, date of birth etc) across the public sector. The reportrecommends that the development of a population register is implemented aspart of the ID Cards Scheme by utilising the National Identity Register (NIR) andthat in the interim a range of short term data sharing initiatives are exploredfurther.
2. Related documents
2.1.1 Annex 2: Stakeholder processes, systems and data comprises of the followingdocuments:
• Annex 2A: Overview
• Annex 2B: Data quality framework
• Annex 2C: Stakeholder profiles
• Annex 2D: Data trial comparative results: This document
• Annex 2E: Data trial comparative results: Appendices
• Annex 2F: Current data sharing across government
• Annex 2G: Other data quality initiatives
2.1.2 This document provides
• A summary of the objectives, scope and methodology of the data trial.
• A summary of the comparative coverage, demographics, quality indexes andmatching of the sample datasets
• Detailed results of the comparative analysis are detailed in Annex 2E: Datatrial: Appendices
• The analysis of each specific dataset is detailed in Annex 2C Stakeholderprofiles and accompanying appendicies.
3. Data trial objectives
3.1.1 The overall objective was to assess the relative and combined quality of basiccontact data held within stakeholders’ operational systems. This incorporatedlooking at the cost effectiveness of cleaning, matching and quality scoringtechniques by using samples of stakeholder data; and assessing the implicationsof applying these techniques to the complete datasets.
3.1.2 To achieve this overall objective, the trial aimed to:
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
6 Scope and methodology
• Provide further understanding of the characteristics and anomalies of identity(e.g. names, date of birth) and contact details held in stakeholders’operational systems.
• Identify fitness for purpose of records and fields in the individual and mergeddatasets by determining appropriate quality level indicator(s).
• Obtain a statistical assessment of the matching records betweenstakeholders’ datasets. This includes the percentage of records that can beautomatically matched and those that have a reasonable probability of beingmatched and may justify manual inspection.
• Develop best practice guidance on data quality and matching.
4. Scope and methodology
4.1.1 Nine demographics were identified as sample areas, selected by one of threecriteria; name, address and date of birth. The total estimated population acrossthe selected demographics was between 20,000 – 90,000 depending on thedataset and the composition of these demographics were reviewed with the ONSMethodology group.
4.1.2 The sample sets were chosen to ensure that the following demographics wouldbe covered by the trial:
Demographics1 Typical dataset by names2 Typical dataset by names3 Typical suburban dataset by geographic area (postcode and area name)s4 Covers name issues and address issues on houses that have been
converted into flats. (postcode)s5 Covers a rural area in Scotland (postcode)s6 Covers issues around Welsh names and addresses (postcode and area
name)s7 Covers issues related to high density urban areas and high rise flat
blockss8 Dataset by specific date of births9 Covers issues around nominated date of birth being 1st January
4.1.3 The Electoral Roll (2003) contains 83% of the 18+ UK population, and is theclosest and most representative available dataset, (apart from datasets which aremaintained in the private sector), to a comprehensive population register againstwhich other datasets can be compared. Demographics based on the samecriteria as the other datasets have been applied to the Electoral Roll and thecurrent population for each demographic determined. The relative size of theElectoral Roll vs Census 2001 was used to correlate the date of birth profiles ofthe sample data sets with the Census 2001 date of birth profile.
4.1.4 A data sharing protocol was produced and reviewed with the InformationCommissioner to provide a robust framework for the legal, secure andconfidential sharing of personal information for the trial. A fundamental principle
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
7 Scope and methodology
was that the trial outputs will be anonymous and mainly statistical and that thedata will be destroyed at the end of the trial. The contractor’s data securityprotocols were audited, inspected and approved by ONS.
4.1.5 The following key stakeholders were identified to participate in the trial byproviding sample contact data based on the demographics described above:
• Department for Work and Pensions
• Driver and Vehicle Licensing Agency
• General Register Office
• General Register Office for Scotland.
• HM Revenue and Customs
• National Health Service Information Authority
• United Kingdom Passport Service
4.1.6 Legal vires for data sharing with the stakeholders were agreed; with the exceptionof the DWP and the NHSIA, both of whom subsequently were unable to providesample data to participate in the trial.
4.1.7 A procurement exercise identified Siemens Business Systems as a specialistcontractor with extensive skills in the areas required to perform the technicalaspects of the trial in the most economically effective way.
4.1.8 The participating stakeholders were given the same data-extract specification andprovided sample data-extracts to the specialist contractor. This covered basiccontact details, such as current name, address and date of birth for thepopulation within the selected demographics and, in the case of HM Revenueand Customs, historical names and addresses.
4.1.9 The contractor reported on:
• Detailed analysis of all input data; for addresses this included comparisonwith electoral register dataset and external address datasets (PAF andNLPG)
• Assessment of address data cleansing possible
• Analysis of data matching between the sample datasets
• Development and application of a data quality index methodology
4.1.10 Subsequently the Atkins Technical Team carried out further analysis of theresults and correlation with other information, e.g. database sizes, demographicsand census profiles. An Excel model of all de-personalised data and results wascreated and used to determine and generate:
• Appropriate weightings for each dataset and demographic (s1-s7)
• Comparative profiles for all datasets
• Comparative profiles of demographics within each dataset
• Matched profiles for selected demographics and different match criteria
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
8 Coverage profiles
5. Coverage profiles
5.1.1 Analysis of the sample datasets by demographic and correlation of these resultsagainst the same demographics from the Electoral Roll enabled coverage profilesagainst date of birth to be generated. These highlighted the following issues:
• DVLA dataset demographics s5 Scotland and s7 Birmingham wereunrepresentative (4% and 6% of expected population compared with otherdemographics 79-87%). The most likely explanation is that the extract of thedata for demographics s3-7 was substantially based on postcodes and thats5 and s7 comprised postcodes containing a padding character, whichinvalidated the extract. As a result of this s5 and s7 were excluded from thecoverage profiles
• HMRC data extract included historical names and addresses and the natureof the data structure resulted in additional citizen records being returned forthose no longer living within the geographical criteria or currently meeting thename criteria for the demographic. The data provided by HMRC containedsufficient information to identify these additional identities and exclude themfrom the analysis.
• Where possible the datasets were modified to exclude all citizens known tobe deceased to provide comparable results to the Electoral Roll / Census2001. However, this information was not available for DVLA, UKPS and maynot be fully current for HMRC datasets.
• There are significant variations of profile between the demographics of thesample datasets, e.g. the age profile varies significantly between s4 Londonand s6 Wales. This was expected as demographics s4-s7 were deliberatelychosen to reflect atypical situations. Weightings were applied at dataset anddemographic level and a sensitivity analysis carried out to ensure the mostacceptable correlation between the sample datasets and other information,e.g. census 2001 profile, the same demographics extracted from theElectoral Roll (representing approx 83% of 18+ population), database sizesrelative to the current population.
5.1.2 The coverage profiles are based on the following parameters:
• 50% records based on typical name demographics s1 and s2
• 20% records based on typical geographical demographic, s3 Bournemouth
• 20% records based on demographic s4 London
• 10% records based on demographic s6 Wales
• dataset weightings to correlate results with actual database sizes obtainedfrom data suppliers (data quality questionnaire).
• DVLA: 94%
• GRO/S: 84%
• HMRC: 108%
• UKPS: 96%
• Electoral Roll / Census 2001: 110%
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
9 Coverage profiles
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Year of birth
Census 2001
DVLA (Drivers)
GRO + GROS (Births)
HMRC (NIRS2)
UKPS (PASS)
Include emigrants and excludes children
Census is the most accurate estimate of the whole population
HMRC and DVLA do not include children and are greater than
Census due to emigrants.
GRO + GRO(S) includes everyone
born in Scotland from 1974 England and Wales from 1993
UKPS (PASS) includes new and renewed UK passports since 1998 (60% of total UK
passports)
DVLA only includes those with a driving
licence
Coverage profile by date of birth
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
10 Critical quality characteristics
6. Critical quality characteristics
6.1 Overview6.1.1 CIP completed a review of the existing data held in key public service systems
through a data trial supplemented by a detailed questionnaire. The results aresummarised below (detailed results are given within each stakeholder profile).
Citizenrecords
Estimatedduplicates
Nameverification
Addressverification
Up to dateaddress
Addressvalidity
DfES StudentLoans
5m < 2% High Initiallyhigh > low
Low High
DVLA (Drivers) 40m 0.17% High Nil ~ 62% High
DVLA (Vehicles) 18m #1 - Medium Nil 90 - 95% High
DWP (DCI) 84m ~ 0.07as per NIRS2
Medium Low Medium High
GRO / GROS(Births)
10m 0.66%(GRO)
Notapplicable
Nil Notupdated
Low
HMRC (CID) 60m - Low Nil Medium High
HMRC (NIRS2) 72m 0.07% Low Low Medium High
UKPS (Main) 70m #2 Passportrenewals
High Low ~ 56% High
UKPS (PASS) 24m #2 Passportrenewals
High Low 70% > 56% High
Identity Cards(Requirements)
40 / 48m(adults)
0% High Low 90 – 95% High
Data trial results
Quality questionnaire response
Target
Notes:#1. Of the 30 million records only 18 million vehicles
have individual citizens as owner.#2 Database is passport centric rather than person
centric.
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
11 Critical quality characteristics
6.2 Scope and size of datasetsCitizenrecords
Comments
DfES StudentLoans
5m
DVLA (Drivers) 40m Active drivers, but includes emigrants and some deceased
DVLA (Vehicles) 18m #1 30 million records, approx. 18m individuals’ names (remainderregistered with organisations)
DWP (DCI) 84.5m 47 million live adult records in UK; 1 million live socialsecurity benefit recipients living abroad; 15 milliondeceased records (date of death verified); 1.5 milliondeceased records (date of death not verified); 5.5million, abroad not in receipt of benefit; 2 million,inactive but not categorised; and 12.5 million childrecords.
GRO / GROS(Births)
10m GRO only available electronically since 1993, GRO(S) since1974
HMRC (CID) 60m
HMRC (NIRS2) 72m Similar to DCI, 6.5m emigrants, 2m inactive and 15mdeceased. No children.
UKPS (Main) 70m #2 Records relate to passports (duplicate records on renewal)
UKPS (PASS) 24m #2 As above, only populated since 1998 (60% of all passportholders)
Identity Cards(Requirements)
40 / 48m(adults)
Target of 40m without compulsion, 48m with compulsion
6.3 Duplicate recordsEstimatedduplicates
Comments
DfES StudentLoans
< 2%
DVLA (Drivers) 0.17% Likely to be mainly associated with paper licences
DVLA (Vehicles) -
DWP (DCI) ~ 0.07as per NIRS2
Based on close similarities with NIRS2
GRO / GROS(Births)
0.66% (GRO)
HMRC (CID) No detailsavailable to
CIP
There are 6.7% of citizens records which for a limited periodare duplicated with a temporary and permanent NINO. This ispart of the business process and the use of these temporaryNINOs is being phased out.
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
12 Critical quality characteristics
Estimatedduplicates
Comments
HMRC (NIRS2) 0.07%
UKPS (Main) Passportrenewals
Legitimate duplicates due to renewals
UKPS (PASS) Passportrenewals
Identity Cards(Requirements)
0%
6.4 Name and address verification
Nameverification
Addressverification
DfES StudentLoans
High Initiallyhigh > low
DVLA (Drivers) High Nil
DVLA (Vehicles) Medium Nil
DWP (DCI) Medium Low
GRO / GROS(Births)
Notapplicable
Nil
HMRC (CID) Low Nil
HMRC (NIRS2) Medium Low
UKPS (Main) High Low
UKPS (PASS) High Low
Identity Cards(Requirements)
High Low
Verification – supporting documents or processes that confirm the information e.g. nameverified by presentation of passport.
Validation – checking that value is within range or exists, checking address againstPostcode Address File (PAF)
Name verification• Critical to many processes
• Striving for Gold standard
Address verification
• Low quality
• Less onerous, fewer critical processes – niche requirement
• Difficult to e-enable
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
13 Address quality / validity
7. Address quality / validity
7.1 Summary
7.2 PAF compliance7.2.1 QAS was used to assess the percentages of addresses which are compliant with
PAF, the results of which are shown below:
DVLA GRO GROS HMRC UKPS
94.1 57.6 68 89.6 95.1Percentage of PAF Compliant Addresses by Stakeholder
7.2.2 DVLA and UKPS both achieved a 95% compliance with PAF, which is above the90% matching level at which the Post Office will start offering mailing discounts.However, for the DVLA results only 6% were actually matched as “VerifiedCorrect” as the DVLA generally omits the town name from its address format,which resulted in QAS making an automatic adjustment to the address formatand classifying those records as only a “Good Match”.
7.2.3 In the HMRC dataset, 89% of addresses complied with PAF but this was the onlydata set to include all historical addresses and the overall score for this datasetsuffered from the obsolescent nature of some of its addresses.
7.2.4 GROS data, taken as a whole for births and deaths, reached a compliancepercentage of 68%. This lower figure is caused largely by the relatively highnumber of both tenement addresses in Scottish towns and cities and the numberof rural addresses outside of cities.
Addressvalidity
DfES StudentLoans
High
DVLA (Drivers) High
DVLA (Vehicles) High
DWP (DCI) High
GRO / GROS(Births)
Low
HMRC (CID) High
HMRC (NIRS2) High
UKPS (Main) High
UKPS (PASS) High
Identity Cards(Requirements)
High
Generally high quality addresses - effectively 90%(assessed using QAS)
Automatic address cleansing is limited tomarginally improving existing good qualityaddresses
Tentative matches – significant numbers can beresolved rapidly by visual inspection
Application of Unique Property Reference NumberVerification (UPRN-NLPG, now NSAI NationalSpatial Address Infrastructure)
As NLPG validated by more LAs and becomesintegral with other systems, so data quality willimprove
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
14 Address quality / validity
7.2.5 GRO produced the poorest results having fewer than 58% of raw data addressescomplying automatically with PAF. The GRO result can be attributed to theconcatenated address data in its sample, which QAS had difficulty automaticallymatching to PAF.
7.3 Address matches across postcode demographic7.3.1 The postcode demographics achieving the highest match rates were s3
(Bournemouth) and s4 (London). The results for s3 were not unexpected, giventhat this was a typical suburban area with limited scope for problem addresses.The scores for s4 were expected to be lower than those actually recorded due tothe number of flat conversions in this area. However, it would appear that theformat of flat addresses in s4 did not have a significant impact on QAS’ ability tomatch addresses.
7.3.2 The s7 (Birmingham) demographic achieved results slightly lower than s3 and s4and this lower result was primarily caused by the concentration of high rise towerblock accommodation in this area, the formatting of which did result in QASrecording lower levels of “Verified Correct” and “Good Full” matches.
7.3.3 The lowest match rates were in s5 (Scotland) and s6 (Wales). The s5demographic was particularly adversely impacted by the combination of poorrural address formats and the high number of obsolete addresses resulting from ahousing estate redevelopment whilst the poor results for s6 were primarily due toproblems with rural address formats only. In fact, the good match percentages fors6 were higher than those for s5 due to the lower predominance of ruraladdresses.
7.4 Language impacts7.4.1 Demographic s6 was of a Welsh postcode area which was partly rural. Possible
issues with the use of Welsh language names had been predicted but, apart froma few records where Welsh names had been spelt incorrectly, the use of Welsh inaddress raw data was not a major factor hindering the overall matching process.
7.5 Foreign addresses7.5.1 The level of “Foreign Address” matches was low with figures at around or below
0.1% for all but the HMRC dataset. “Foreign Address” matches for HMRC wereactually reduced because many were given a match type of “Unmatched” withparticular issues around addresses having the country name of Ireland, whichwas not recognised, instead of Eire or the Irish Republic.
7.5.2 Overall, foreign addresses only accounted for 0.3 % of total addresses and didnot have a material impact on address match rates.
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
15 Address cleansing
8. Address cleansing
8.1.1 The automatic improvement to addresses can only be confidently applied tomatches that already qualify as good or better (i.e. “Verified Correct” and “GoodFull” matches). To maximise the quality of address data and increase the overallfigures for PAF compliance, manual matching will be necessary. The match typegroupings produced by QAS Batch confer confidence levels on the matches itprovides and separate analysis has shown that addresses with match types of“Tentative” and “Partial” offer considerable potential for increasing the overallnumber of address matches through a separate exercise of manual matching.Whilst some of this manual matching can be accomplished quite easily (less thanone minute per record), it has not been possible to accurately assess the totaleffort required to undertake a complete manual review of all records which QAShas not classified as a “Verified Correct” and “Good Full” match.
9. Linking to NLPG
9.1 Overview of NLPG9.1.1 The National Land and Property Gazetteer (NLPG) is a single, comprehensive list
of addresses that was initially generated from Valuation Office records. Thevalidation and maintenance of these addresses has been devolved to each LocalAuthority, who maintain a Local Land and Property Gazetteer, which issynchronised with the NLPG. All the data is held in a common format and eachproperty is assigned a unique property reference numbers (UPRN) andgeographical grid references. These co-ordinates allow individual properties tobe accurately identified within ad hoc boundaries (e.g. Primary Care Trustcatchment areas, and areas defined for Neighbourhood Statistics) usinggeographical information systems and enable dwellings in remote areas to beaccurately located where one postcode might cover a very wide area.
Difficulties in obtaining NLPG data
9.1.2 Obtaining access to the NLPG dataset for use on the CIP trial proved extremelyproblematical. This was primarily due to the difficulties Siemens encountered inobtaining the necessary approvals for the release of this data as licencing issuesmeant that it was not possible to obtain the complete national NLPG dataset andthe local authorities, whose demographic area was covered in the trial, werereluctant to release such data to the CIP trial. As a result, further delays wereencountered and the local authority datasets that were eventually delivered toSiemens and could be used on the trial were restricted to the following:
• Wandsworth
• Bournemouth
• Poole
• Pembrokeshire
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
16 Linking to NLPG
9.1.3 A major learning point to be carried forward for any similar exercises requiringaccess to NLPG data in the future is that careful consideration may have to begiven to how best to gain access to such data. A separate lobbying process maybe required to win the support and cooperation of local authorities and otherrelevant Government agencies to facilitate the willing release of data by thesebodies in a timely manner. It is hoped that the launch of the NSAI (NationalSpatial Address Infrastructure), which seeks to integrate NLPG, Royal Mail andOS address data, will provide impetus to LA’s validating and using a singleaddress register and the adoption of the UPRN.
Objectives
9.1.4 The sample datasets were matched against the NLPG data using i/Lytics toidentify the level of address matching possible to enable the allocation of UniqueProperty Reference Numbers (UPRN) and compared with similar matching usingQAS to establish if NLPG data might be used to improve the quality of addresses(completeness, consistency, format and validity).
Results
9.1.5 Due to the limited number of available datasets the NLPG data used in the trialonly covered the s3, s4 and s6 samples, and results were limited to thesedemographics. Consequently, the results did not include any matches with theGROS dataset.
9.1.6 The actual matching levels obtained, as a percentage of s3, s4 and s6 data, areshown below.
DVLA GRO GROS HMRC UKPS
68.76% 33.38% 69.69% 67.62%
Percentage of address records in s3, s4
9.1.7 Compared with QAS matching levels NLPG matched between 70% -80% ofaddresses in demographics s3, s4 and s6. This could be partly due to NLPG datanot having identical boundaries to postcode areas and some of the demographicsfalling outside the NLPG area.
DVLA GRO GROS HMRC UKPS
72.56% 78.53% 74.79% 70.95%NLPG matches as % QAS matches for s3, s4
9.1.8 Stakeholder addresses matched NLPG data in broadly the same proportions asthey were matched by QAS with the single address field format of the GRO dataachieving considerably fewer matches.
9.1.9 The conclusions from the partial NLPG matching is that QAS gives levels ofaddress matching approximately 25% higher. However, these figures should betreated with some caution due to the limited scope of the NLPG analysis resultingfrom the limited amount of NLPG data made available to the trial.
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
17 Dataset matching - methodology
9.1.10 We recommend that the use of NLPG data (or the subsequent National AddressInfrastructure) and the allocation of a UPRN to all citizen addresses should bepursued, as this will yield significant benefit when sharing data and will limit themanual matching effort to the initial allocation.
9.1.11 Currently 81% Local Authorities, in the England, Scotland and Wales, havevalidated their LLPG data and 55% of LAs are actively maintaining this data.Assuming that this initiative continues across all LAs and that LAs, as they adoptCRM solutions, will use their LLPG data across all their applications, then thequality of this data will significantly improve and achieve a level similar to PAF.
10. Dataset matching - methodology
10.1.1 The raw datasets (175,000 records) were rationalised into a common format andwhere alternative or historical names and addresses existed these wereconverted into 145,000 additional records (i.e. a record was created for eachcombination of name and address in the original record).
10.1.2 All datasets were then matched using the i/Lytics tool using the ranking criteriadescribed in Appendix 3.10 which utilised all the primary data items (includingdate of birth, names, and addresses).
10.1.3 The i/Lytics system sorts and compares all the records using exact and fuzzymatching and utilises heuristic rules related to abbreviations and permutations ofname and address elements. Groups with similar records, called “families”, arecreated and the record with the most complete information is identified as“parent” and all the other records in the family termed ‘members’. Each memberis compared against the parent and the type of similarity between the parent andeach record is termed the “rank” of the match and is a complex combination ofmatching rules associated with each data item. For more details refer to Appendix3.10. ‘Automatic’ ranks are those where the similarity between two records ishigh enough that the records can be considered duplicates without any furtheranalysis or inspection. An initial automatic match rate of 25% was achieved withexact matching and subsequently enhanced to 49% with the inclusion of fuzzymatching and optimisation of the ranks yielding satisfactory results and very lowprobabilities of false matching.
10.1.4 Each family group is then de-duplicated using the unique id allocated to the rawrecords, i.e. this re-combines permutations of name and address, but ensuresthat matching has been achieved utilising all these permutations. Family groupsmay then be classified as:
• Parent records with no children: Original records do not match any others
• Parent records with children from different datasets: legitimate matches
• Parent records with more than one child from the same dataset: Potentialduplicate records, i.e. the same date of birth, name and address but withdifferent stakeholder id (NINO, licence, etc).
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
18 Matching by date of birth, name and address details
10.1.5 The members within each family are then analysed and a ‘family composition’report generated identifying the combinations of matching. These results areaggregated to give match rates for all combinations of datasets.
10.1.6 In addition to matching on all primary fields, the process has been repeated usingmore relaxed matching criteria:
• date of birth + names
• date of birth + surname
10.1.7 From these additional matches the following can be derived:
• extent of identities (i.e. matching on date of birth and names) with differentaddresses
• extent of missed identity matches – by broadening the criteria to date of birthand surname
• some indication of false matches by inspecting the occurrences within eachmatch group (family comosition)
• ‘no match’ records – those that will never match, e.g. citizen with only adriving licence and no passport. There are a limited number of scenarios notconsidered, e.g. change of name due to marriage / divorce, but these are notlikely to be significant (i.e. number of marriages / divorces in a year isrelatively small to total population).
11. Matching by date of birth, name and addressdetails
11.1 Match results11.1.1 Number of stakeholder records matched as a percentage of all stakeholder
records (considering all datasets and demographics, without any weightings)
StakeholderAll
datasetsDVLA
GRO
(B+D)
GROS
(B+D)HMRC UKPS
Births
(GRO+
GROS)
Deaths
(GRO+
GROS)
All records 175,268 39,004 12,969 5,187 93,580 24,528 11,428 6,728
Matchedrecords
84,646
(48%)
27,123
(69%)
4,371
(33%)
902
(17%)
34,152
(36%)
18,098
(73%)
2,226
(19%)
3,047
(45%)
DVLA 13 981 74 25550 9114 38 1017
GRO (B+D) 57 0 2223 1867
GROS (B+D) 27 695 194
HMRC 34 14358 210 2708
UKPS 98 1952 109
Births(GRO+GROS)
60
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
19 Matching by date of birth, name and address details
11.1.2 In the above table a record refers to a person with a unique id (e.g. NINO, licenceno, etc), except in the case of UKPS where it refers to a passport no. whichchanges on renewal.
41%85%59%
UKPS
15%HMRC
HMRC UKPS
34%73%66%
DVLA
27%HMRC
HMRC DVLA
77%63%23%
DVLA
37%UKPS
UKPS DVLA
11.2 Interpretation of results11.2.1 It is important to recognise that the match percentages reported are more heavily
influenced by the nature of the datasets than by the efficacy of the matchingprocess, e.g. children in UKPS dataset can never be matched to DVLA datawhich applies only to over 16s. e.g. UKPS matched against DVLA.
11.2.2 The following match profiles were based on the same dataset and demographicweightings used to analyse comparative coverage. The match percentagesbetween all unweighted datasets and demographics is not significantly differentto the following.
UKPS (PASS)
Automatic matches
0
50
100
150
200
250
300
350
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Year of birth
No.
of r
ecor
ds in
sam
ple
data
set
Census 2001
Grey match (dob, surname)
Full match (dob, name, address)
DVLA (Drivers)
UKPS (PASS)
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
20 Matching by date of birth, name and address details
Matching between DVLA (Drivers) and UKPS (PASS) datasets
No matches i.e.
passport holders (inPASS) without drivers
licence
37% UKPS (PASS)records automaticallymatch DVLA records
No matches i.e. driverswithout records in PASS
database
23% DVLA recordsautomatically match
UKPS (PASS) records
DVLA drivers
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
21 Matching by date of birth, name and address details
11.3 Matches against each stakeholder dataset
-
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
100,000
DVLA GRO (BIRTHS+DEATHS) GROS (BIRTHS+DEATHS) IR UKPS
Unique IDs in Input Unique IDs in Match Families
11.3.1 The above graph shows the numbers of input records and match records per stakeholder and gives an indication for the percentage match rate ofeach stakeholder against all records. These figures are discussed below.
11.3.2 As can be seen, the matching levels within the merged dataset revealed a sizeable disparity between stakeholders with far higher percentage matchrates from DVLA and UKPS of 69.54% and 73.39% respectively. Matching levels for birth and death records were substantially lower whilst HMRCrecords, although having more records matched than any other stakeholder, only matched 36.49% of distinct records.
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
22 Matching by date of birth, name and address details
11.3.3 The disparity of matching levels between stakeholders can be attributed to anumber of identifiable factors specific to one or more demographic as listedbelow.
Dataset Factors with a positive influenceon matching rates
Factors with a negative influenceon matching rates
DVLA • High level of PAF compliantaddresses.
• Current and updated data
• Not all citizens have a drivinglicence
• Not applicable to under 16s
UKPS • High level of PAF compliantaddresses.
• No data over six years old, i.e.prior to 1998
• Not all citizens have apassport
• Only 60% of passport holderson this database
HMRC • Large coverage • Not applicable to under 16s• Temporary residents working
in the UK• Older data now obsolete e.g.
deaths predated otherstakeholder data
• Older data now outside ofsampled demographics e.g.Person living in s3 andmoving before creation ofother stakeholders’ datasets
GRO • Persons born before 1993 notin dataset
• Poor PAF compliance due toconcatenation of addresseselements
• Birth data on children tooyoung to appear in other data
GROS • Persons born before 1973 notin dataset
• Low PAF compliance due tomore complex nature ofScottish addresses
• Low numbers of people in theScottish postcode s5demographic in otherstakeholders
• Birth data on children tooyoung to appear in other data
11.4 Identification of duplicate records11.4.1 The number of matched family records per dataset is shown in the chart below
with a count showing number of matches within a dataset. For example, there are
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
23 Matching by date of birth, name and address details
65 matches of identity within the GRO dataset and 1 example of four UKPSrecords in the same match family.
11.4.2 From inspection all these records (except for UKPS where a record is related to apassport rather than a citizen and indicate passport renewals) are duplicaterecords, i.e. a person having more than one unique id within a dataset.
1
2
3
4
DV
LA
IR UK
PS
GR
O (B
IRT
HS
+DE
AT
HS
)
GR
OS
(BIR
TH
S+D
EA
TH
S)
BIR
TH
S (G
RO
+GR
OS
)
DE
AT
HS
(GR
O+G
RO
S)
26,929
97
34,012
70
17,072
493
12
1
3,0472,160
33
846
28
4,241
65
-
5,000
10,000
15,000
20,000
25,000
30,000
35,000
PrevalenceCount
11.5 Family composition
11.5.1 The following graph identifies the matches between different datasets
CIP: Data quality, sharing and processing
Citizen Information Project
Data trial comparative results
24 Matching by date of birth, name and address details
23456
DV
LA O
nly
GR
O O
nly
GR
OS
Onl
y
IR O
nly
UK
PS
Onl
y
DV
LA, G
RO
DV
LA, G
RO
S
DV
LA, I
R
DV
LA, U
KP
S
GR
O, I
R
GR
O, U
KP
S
GR
OS
, IR
GR
OS
, UK
PS
IR, U
KP
S
DV
LA, G
RO
, IR
DV
LA, G
RO
, UK
PS
DV
LA, G
RO
S, I
R
DV
LA, I
R, U
KP
S
GR
O, I
R, U
KP
S
GR
OS
, IR
, UK
PS
DV
LA, G
RO
S, I
R, U
KP
S
DV
LA, G
RO
, IR
, UK
PS
DV
LA, B
IRTH
S
DV
LA, D
EA
THS
BIR
THS
, DE
ATH
S
BIR
THS
, IR
BIR
TH
S, U
KP
S
DE
AT
HS
, IR
DV
LA, B
IRTH
S, I
R
DV
LA, D
EA
TH
S, I
R
DV
LA, D
EA
THS
, UK
PS
BIR
TH
S, D
EA
TH
S, I
R
BIR
TH
S, I
R, U
KP
S
DE
AT
HS
, IR
, UK
PS
DV
LA, D
EA
TH
S, I
R, U
KP
S
DV
LA, B
IRT
HS
, IR
, UK
PS
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
FamilySize
The composition of families by stakeholders.
11.5.2 The above confirms that when matching on all primary fields, the occurrence of ‘false matches’ is negligible and due solely to duplicate identities.
CIP: Data quality, sharing and processing
Citizen Information Project
25 Error! No text of specified style indocument.
11.6 Matching results by demographic11.6.1 The matching results obtained by demographic split are shown below:
-
10,000
20,000
30,000
40,000
50,000
60,000
s1 s2 s3 s4 s5 s6 s7 s8 s9
Unique IDs in Input Unique IDs in Match Families
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
26 Matching by date of birth, name andaddress details
11.6.2 Across demographics the match level percentages for unique id records areshown in the table below:
Demographic Total Matched Match %
All datasets and demographics 175,268 84,646 48%s1 Surname beginning “XXX” 12,162 6,959 57%s2 Surname beginning “YYY” 9,579 5,989 63%s3 Bournemouth 32,087 18,220 57%s4 London 50,482 23,076 46%s5 Scotland 14,589 3,816 26%s6 Wales 12,692 8,699 69%s7 Birmingham 26,908 7,450 28%s8 DOB - Random 6,471 4,200 65%s9 DOB – 1/1 from mid 70s 10,298 6,237 61%
11.6.3 The results for s5 and s7 reflect the very low numbers of records retrieved fromthe DVLA drivers database for those demographics and should be disregarded.
11.6.4 The consistency of data over time is also comparable with matching levels. Thedate of birth demographics s8 and s9, based on data that should never change,show a greater matching percentage with s8 levels higher than s9. This ispossibly due to dates of birth given as first of January not being consistently usedelsewhere. The s1 and s2 demographics are based on fairly consistent namedata but changes in surname will reduce the number of matches. Address datafor an individual can change often which reduces matching levels. Areas such ass3 and s6, which could be expected to have a more static population, show muchbetter matching levels.
11.6.5 For example, date of birth demographics s8 and s9 show a higher matchingpercentage than any other demographic type. This may be due the date of birthbeing static through a person’s lifetime when address date and, even name data,can be prone to change. There is higher percentage of s8 records matched thans9 which may indicate birth dates of 1st of January are often guessed orapproximated and are not used consistently by people.
11.7 Influence of address cleansing on overall matchingstatistics
11.7.1 QAS address cleansing had minimal effect on increasing matches. Removingaddress data from matching criteria increased matches by just over 5%, from48.29% to 53.41%, indicating that address quality was not hugely significant insecuring matches due to the overall good quality of addresses in the DVLA andUKPS datasets.
11.8 Influence of other datasets on matching11.8.1 The matching of all the datasets by date of birth, names and addresses was
repeated with CACI Enhanced Electoral Roll data included. This resulted in an
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
27 Matching by date of birth and nameelements
increased match rate of 7% for the HMRC dataset, 3% for DVLA, 1.5% for UKPSand nominal effect on GRO / GROS.
12. Matching by date of birth and name elements
12.1.1 Relaxation of the matching criteria to exclude address details results in almost10% more matches than previously. However, some measure of the falsematches occurring may be derived from the family composition diagram wherethere is a small increase in the occurrences of families with more members thanshould be expected, e.g. where matching occurs between DVLA, UKPS and IRthere are 6 members in a family of size 4 indicates that there are 6 x (4-3) = 6false records.
12.1.2 By further relaxing the criteria to just date of birth and surname, the increase ofmatches will include any missed matches in the previous analyses, but there willbe more false matches. This gives an indication of the grey area of matching forthis sample size, i.e. the difference between the records that conclusively match(based on extensive criteria), and people that are unlikely to ever match (dob andsurname are unique) e.g. they only have a driving licence and no passport.
12.1.3 The difference between matching by date of birth and names vs date of birth andjust surname showed only a small difference. This is likely to be due to the smallsize of the data samples.
12.1.4 This result cannot be directly extrapolated to a large dataset as if there may wellbe only one Smith born on a specific day in a dataset of 100 members, but therewill be a number of Smiths born on that day in a dataset of 10 million. However,from analysis of surname and date of birth statistics it is known that within the UKpopulation 90% of people have a unique combination of date of birth andsurname. Thus the extrapolated ‘no match’ result cannot fall below 90% of theextrapolated value.
12.1.5 This enables a matching percentage to be derived, which is only related to theefficacy of the match and not skewed by members who will never match.
13. Analysis of address changes
The following results were obtained for the limited and weighted demographics / datasetsused in the coverage profiling:
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
28 Analysis of address changes
Dob + Name + Address DVLA IR UKPS
No % No % No % AdjustedAll records 11,941 16,036 6,560 DVLA 8,266 51.5% 3,164 48.2% 80.4%IR 8,266 69.2% 0.0% 3,864 58.9% 98.2%UKPS 3,164 26.5% 3,864 24.1% GRO Dob + Surname DVLA 9,443 58.9% 3,464 52.8% 88.0%
IR 9,443 79.1% 4,244 64.7% 107.8%UKPS 3,464 29.0% 4,244 26.5% GRO People with different addresses DVLA 1,178 7.3% 300 4.6% 7.6%IR 1,178 9.9% 380 5.8% 9.7%UKPS 300 2.5% 380 2.4% As a % of matched addresses DVLA 12.5% 8.7% 14.4%IR 12.5% 9.0% 14.9%UKPS 8.7% 9.0% UKPSadjusted 14.4% 14.9%
Results give the number of different addresses as between 9-15% of matched records.These represent the records shown in the diagram below:
The unknown remains the number of records where both databases hold out of dateaddresses.
Drivers andpassport holders
with currentaddress in both
databases
Passport holders witha current address
Drivers with acurrent address
Drivers and passportholders with old address
in both databases
Drivers and passportholders with old address
(DVLA) and currentaddress (UKPS)
Drivers and passportholders with old address
(UKPS) and currentaddress (DVLA)
Drivers with old address
Passport holders withold address
UKPS
DVLA
Drivers and passport holderswith old and current
address
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
UKPS
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
DVLA(Drivers)
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
29 False matches
14. False matches
Inspection of the family composition results show an increased number of false matchesas expected.
Match by DoB +surname
Matches byDoB + name
(fuzzy)
Matches byDoB +
surname
DVLAMatches + falsematches 46% 43% 38%
No matches 54% 57% 62%
UKPSMatches + falsematches 29% 27% 24%
No matches 71% 73% 76%
The combination of false and missed matches as the match criteria is relaxed isillustrated in the following diagram:
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
30 False matches
Matching by dob +name + address
James Doe20 High St01/01/1950
ID 1000UKPS
James Doe20 High St01/01/1950
ID 1000DVLA
Sue Doe12 Bridge St01/01/1950
ID 1002DVLA
John Jones80 Main St01/01/1950
ID 1004DVLA
Ann Doe10 Kings Rd01/01/1950
ID 1003UKPS
John Doe12 Bridge St01/01/1950
ID 1001DVLA
Automaticmatch 1
No matches1- 6
Matching by dob +name
Automaticmatch 1
Matching by dob +surname
Automaticmatch 1 No match 1
NameAddress
Date of birthID no
Stakeholder
Susan Doe12 Bridge St01/01/1950
ID 1002UKPS
Correct Correct Missed Missed Missed Correct Correct
Correct Correct Correct Missed Missed Correct Correct
No matches1- 4
Correct Correct False match False match False match False match Correct
John Doe20 High St01/01/1950
ID 1001UKPS
Missed
Correct
False match
Automaticmatch 2
Proportion remainsunchanged as sample
is scaled upGrey matches - ratio of missed / false / correct matches
varies as sample is scaled up
% of population with full matchand who hold a passport and
drivers licence, i.e. matchcriteria is so strict that no false
matches exist
Proportion reduces as sample is scaled up but can never go
below minimum of % populationwith unique combination of dob
and surname and which holdeither a passport or a drivers
licence (ie ratio ofpassports:drivers)
CIP: Data quality, sharing and processingCitizen Information Project
Data trial comparative results
31 False matches
2
5
8
DV
LA O
nly
GR
O O
nly
GR
OS
Onl
y
IR O
nly
UK
PS
Onl
y
DV
LA, G
RO
DV
LA, G
RO
S
DV
LA, I
R
DV
LA, U
KP
S
GR
O, I
R
GR
O, U
KP
S
GR
OS
, IR
GR
OS
, UK
PS
IR, U
KP
S
DV
LA, G
RO
, IR
DV
LA, G
RO
, UK
PS
DV
LA, G
RO
S, I
R
DV
LA, I
R, U
KP
S
GR
O, I
R, U
KP
S
GR
OS
, IR
, UK
PS
DV
LA, G
RO
S, I
R, U
KP
S
DV
LA, G
RO
, IR
, UK
PS
DV
LA, B
IRT
HS
DV
LA, D
EA
TH
S
BIR
TH
S, D
EA
TH
S
BIR
TH
S, I
R
BIR
TH
S, U
KP
S
DE
AT
HS
, IR
DV
LA, B
IRT
HS
, IR
DV
LA, D
EA
TH
S, I
R
DV
LA, D
EA
TH
S, U
KP
S
BIR
TH
S, D
EA
TH
S, I
R
BIR
TH
S, I
R, U
KP
S
DE
AT
HS
, IR
, UK
PS
DV
LA, D
EA
TH
S, I
R, U
KP
S
DV
LA, B
IRT
HS
, IR
, UK
PS
-
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
FamilySize
14.1 Family composition - matching by date of birth and names14.1.1 This analysis identifies the increased level of matching and the occurrence of a small number of false matches as a result of relaxing the
matching criteria to exclude address.