INDEPTH Data Quality Workshop
Program and Curriculum
11-13 May 2010, Accra, Ghana
Course Facilitator : Dr Kobus Herbst
1 Workshop Objectives1. Create a common understanding of data quality in the context of health and demographic
surveillance2. Learn from the experience regarding data quality in the iShare initiative3. Gain practical experience in measuring data quality in HDSS databases4. Derive and agree on minimum data quality metrics for INDEPTH sites5. Apply a minimum set of common data quality metrics to own HDSS database6. Discuss the form and content of site data quality improvement projects and INDEPTH’s role
in promoting such
2 Outcomes1. Minimum set of INDEPTH Data Quality Metrics defined2. Site data quality baselines established3. Common outline and criteria for site data quality projects agreed to4. Recommendation made for an INDEPTH Data Quality Assurance Program.
2
3 ProgramTime Topic Presenter
Day 1 : 11 May 20108:00-9:00 Registration INDEPTH Secretariat9:00-9:30 Welcome and Introduction to Workshop Objectives INDEPTH Executive
Course Facilitator9:30-10:30 What is Data Quality? Course Facilitator10:30-11:00 Tea Break11:00-11:30 Impact of Data Quality on Demographic Measures Ayaga Bawah11:30-12:00 Extend and Implications of Poor Quality Data – iShare Experience iShare representative12:00-12:30 Causes of Poor Quality Data Course Facilitator12:30-13:30 Lunch Break13:30-14:30 Measuring Data Quality : Theory Course Facilitator14:30-15:30 Measuring Data Quality : iShare Experience iShare representative15:30-16:00 Tea Break16:00-17:00 Measuring Data Quality : Practical - Attribute domain constraints Course FacilitatorDay 2 : 12 May 20108:30-9:30 Measuring Data Quality : Practical – Relational integrity constraints Course Facilitator9:30-10:30 Measuring Data Quality : Practical – Historical Data & State
Dependant ObjectsCourse Facilitator
10:30-11:00 Tea Break11:00-11:30 Measuring Data Quality : Practical – General Attribute Dependencies Course Facilitator11:30-13:00 Discussion : Agreeing on a minimum set of data quality metrics for
INDEPTHAll Participants
13:00-14:00 Lunch Break14:00-17:00 Applying agreed set of data quality metrics to own database All ParticipantsDay 3 : 13 May 20108:30-10:00 Comparison & Standardisation of Minimum Data Quality Metrics Course Facilitator10:00-10:30 Tea Break10:30-11:00 Total Data Quality Management : Theory Course Facilitator11:00-12:30 Discussion : Data Quality Assurance in INDEPTH : The Way Forward All Participants12:30-13:00 Publication : Workshop Proceedings Course Facilitator13:00-14:00 Lunch14:00-16:00 INDEPTH Minimum Dataset INDEPTH Secretariat
3
4 Curriculum
4.1 What is Data Quality?
4.1.1 Learning Objectives1. Explain the different roles that can be identified in the information production system2. Understand the concept of an information product, and relate that to the HDSS research
context3. Understand and explain the different concepts of data quality4. Identify the dimensions of data quality most relevant to HDSS
4.1.2 Content1. Information System Roles2. Information Products3. Concepts & Dimensions of Data Quality
4.1.3 Pre-reading and Reference Material1. Carlo Batini, Monica Scannapieca. Data Quality. Concepts, Methodologies and Techniques.
2006. Springer Berlin. Pp 1-49.2. Jack E. Olson. Data Quality. The Accuracy Dimension. 2003. Morgan Kaufmann. San
Francisco. Pp 3-64.3. Census Bureau Methodology & Standards Council. Census Bureau Principle: Definition of
Data Quality. 2006. US Census Bureau.4. Danette McGilvray. Executing Data Quality Projects. Ten Steps to Quality Data and Trusted
Information. 2008. Morgan Kaufmann Burlington. Pp30-33.5. Tim Holt, Tim Jones. Quality work and conflicting quality objectives. 1998. 84th DGINS
conference, Stockholm 28-29 May 1998. Office for National Statistics, UK.
4
4.2 Impact of Data Quality on Demographic Measures
4.2.1 Learning ObjectivesTo be provided
4.2.2 ContentTo be provided
4.2.3 Pre-reading and Reference MaterialTo be provided.
4.3 Extend and Implications of Poor Quality Data – iShare Experience
4.3.1 Learning ObjectivesTo be provided
4.3.2 ContentTo be provided
4.3.3 Pre-reading and Reference MaterialTo be provided.
5
4.4 Causes of Poor Quality Data
4.4.1 Learning Objectives1. Able to classify and describe the causes of poor data quality
4.4.2 Content1. Research Design
a. Research Questionb. Research Methodologyc. Data System Design
2. Population Factorsa. Educationb. Cultural
3. Data Collectiona. Field workersb. Data collection instrumentsc. Data Entry
4. Data Analysisa. Data Conversionb. Data Extractionc. Data Cleaning
4.4.3 Pre-reading and Reference Material1. Van den Broeck, J., S.A. Cunningham, R. Eeckels, and K. Herbst, Data cleaning: detecting,
diagnosing, and editing data abnormalities. PLoS Med, 2005. 2(10): p. e267.
6
4.5 Measuring Data Quality
4.5.1 Learning Objectives1. Classify, list and explain the different rules that can be applied to measure data quality
4.5.2 Content1. Data Quality Rules
a. Attribute domain constraints b. Relational integrity constraintsc. Rules for historical data d. Rules for state-dependent objects e. General attribute dependency rules
4.5.3 Pre-reading and Reference Material1. Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. Data Quality Assessment. Communications
of the ACM. April 2002/Vol. 45, No. 4ve. p211.2. Arkady Maydanchik. Data Quality Assessment. 2007. Technics Publications.
4.6 Measuring Data Quality : Practical
4.6.1 Learning Objectives1. Apply Data Quality Rules to DSS Reference Data Model to derive data quality indicators
4.6.2 ContentThe examples are all based on a sample database based on the INDEPTH Reference Data Model. See Appendix A. The SQL used to derive the data quality indicators are contained in Appendix B. The SQL dialect is SQL Server 2008 T-SQL.
1. Attribute domain constraints a. Optionality Constraints
These constraints prevent attributes from taking Null, or missing, values. Default values are often entered to circumvent the Not-Null constraints, i.e., the attribute is populated with a default value when actual value is not available.
Example: Cause of Death codes
Cause n
Unassigned
520
Indicator=1−Unassigned+NullTotal
Null 745
Assigned 8225
Total 9490
Indicator
86.7%
7
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20100.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
Data Quality Cause of Death
b. Format Constraints
These constraints define the expected form in which the attribute values are stored in the database field. Format constraints are most important when dealing with “legacy” databases. However, even modern databases are full of surprises. From time to time, numeric and date/time attributes are still stored in text fields.
Example : Surname field containing invalid characters.
Use wildcard characters or regular expressions to detect format violations. The specific function is quite specific to particular database used. In SQL 2008 T-SQL, I am using the PATINDEX function to find any LastName with a character not in the set of capital and lower case alpha characters and a space and single quote (‘) character.
SELECT COUNT(*)FROM dbo.IndividualsWHERE PATINDEX('%[^a-zA-Z '']%',LastName)>0
LastName
n
Valid12627
5
Indicator=1− InvalidTotal
Invalid 137
Total12641
2Indicator
99.9%
c. Valid Value Constraints
These constraints limit the permitted attribute values to a prescribed list or range. Unfortunately, valid value lists are often unavailable, incomplete, or incorrect. To identify valid values, we first need to collect counts of all actual values. These counts can then be analyzed, and actual values can be cross- referenced against the valid value list, if available. Values that are found in many records are probably valid, even if they are missing from the data dictionary. Such circumstances typically arise when new values are added after the original database design, but are not added to the documentation. Values that have low frequency are suspect.
8
Example : Residency episode initiating event type.
Resident episode should only be started by DSS start, birth or in-migration.
Start Type
n
Valid 168544
Indicator=1− InvalidTotal
Invalid 2109
Total 170653
Indicator
98.8%
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201094.0%
95.0%
96.0%
97.0%
98.0%
99.0%
100.0%
Data Quality : Residency Start
Example : Birth Weights
0300
10001300
16001900
22002500
28003100
34003700
40004300
46004900
52000
50
100
150
200
250
300
350
400
450
500
Birth Weight
9
d. Precision and Granularity Constraints
These constraints require all values of an attribute to have the same precision, granularity, and unit of measurement. Precision constraints can apply to both numeric and date/time attributes. For numeric values, they define the desired number of decimals. For date/time attributes, precision can be defined as calendar month, day, hour, minute, or second. Data profiling can be used to calculate distribution of values for each precision level.
Example : Date of Birth Precision
Date Precision
Precision n Score Formula
Day 1 15765 141885Score=(10−Precision )×Frequency Precision
ScoreMax= ∑Precision
Frequency×9
ScoreTotal= ∑Precision
Score
Indicator=1−ScoreMax−ScoreTotal
ScoreMax
Week 2 66 528
Fortnight 3 2 14
Month 4 519 3114
Quarter 5 11 55
Semester 6 67 268
Year 7 0 0
Decade 8 0 0
Unknown 9 0 0
Total 147870 16430 145864
Indicator 98.6%
Example : Migration Date Precision
Indicator Value
External In-Migration Date 77.7%
External Out-Migration Date 77.4%
Internal In-Migration Date 79.0%
Internal Out-Migration Date 78.1%
2000 2001 2002 2003 2004 2005 2006 2007 2008 20090%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Migration Date Precision
2. Relational integrity constraintsa. Identity rules
10
An identity rule validates that every record in a database table corresponds to one and only one real world entity and that no two records reference the same entity.
Example : Potential Individual duplications
Similarity measure = Levenshtein distance1 (Firstnamea, Firstnameb) +Levenshtein distance (Lastnamea,Lastnameb) +Sexa=Sexb ? 0 : 1 +ABS(YEAR(DoBa) -YEAR(DoBb)) +ABS(MONTH(DoBa) - MONTH(DoBb)) +ABS(DAY(DoBa) - DAY(DoBb))
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 540
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Individual Identifier Similarity Measure Distribution
Similarity
n
0 42
Indicator=1− Individuals−UniqueIndividualsIndividuals
= 99.4%
1 238
2 442
3 820
4 1699
5 3832
6 8349
7 16849
8 31836
9 59679
1011003
8
Similarity = 1IndA IndB Name A Name B Sex Sex DoB A DoB B
1 The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
11
A B
4614316
9Nesbit, Nqobile Nesbit, Nqobile FEM FEM
1992/09/08
1993/09/08
1001
1005 Nguyen, Simbongiwe Nguyen, Sibongiwe FEM FEM1995/03/25
1995/03/25
Similarity = 2Ind A
Ind B Name A Name B Sex A
Sex B
Do BA Do BB
1388
18938
Mitchell, Hlengiwe Mitchell, Hlengiwe FEM FEM1983/05/16
1983/04/15
3378
3380 Myers, Sandile Myers, Zandile MAL FEM1987/11/25
1987/11/25
Similarity = 3Ind A
Ind B Name A Name B Sex A
Sex B
Do BA Do BB
84 85 Johnson, Ntando Johnson, Nontando MAL FEM1983/12/03
1983/12/03
255 260 Sosibo, Thandiwe Sosibo, Thandeka FEM FEM1976/05/07
1976/05/07
569 12191 Smith, Bongani Smith, Lindani MAL MAL1994/08/14
1994/08/14
585 35418 García, Sanele García, Zanele MAL FEM1997/12/06
1996/12/06
b. Reference rulesA reference rule ensures that every reference made from one entity occurrence to another entity occurrence can be successfully resolved. Each reference rule is represented in relational data models by a foreign key that ties an attribute or a collection of attributes of one entity with the primary key of another entity. Foreign keys guarantee that navigation of a reference across entities does not result in a “dead end.”
Example : Child to Parent references.
Status Mother Father
Known 74,043 32,257
IndicatorA=1−MissingTotal
Indicator B=1−Missing+UnknownTotal
Missing 2,855 9,708
Unknown 49,514 84,447
Total 126,412 126,412
Indicator A 97.7% 92.3%
Indicator B 58.6% 25.5%
c. Cardinal rulesA cardinal rule defines the constraints on relationship cardinality. Cardinal rules are not to be confused with reference rules. Whereas reference rules are concerned with the identity of the occurrences in referenced entities, cardinal rules define the allowed number of such occurrences.
Residency
Wrong Correct
Exists 170653 124657
Indicator=1−¿¿None 1755 1755
Total 172408 126412
Indicator
99.0% 98.6%
12
0 1 2 3 4 5 6 7 8 9
Cardinality 1755 90779 24790 6822 1686 436 115 19 9 1
5000
15000
25000
35000
45000
55000
65000
75000
85000
95000
Residency Cardinality
d. Inheritance rulesAn inheritance rule expresses integrity constraints on entities that are associated through generalization and specialization, or more technically through sub- typing.
Example : Not available.
3. Rules for historical data a. Currency Rule
A currency rule enforces the desired “freshness” of the historical data. Currency rules are usually expressed in the form of constraints on the effective date of the most recent record in the history. For example if the status of an individual under surveillance is 'Current', then the last visit date should be no earlier than the start of the previous surveillance round.
Example 1 : Last observation for current residency episodes must be at least in previous census round.
Currency Residency
EpisodesCurrent 62621
Indicator=1−NotCurrentTotal
Not Current 2384
Total 65005
Indicator 96.3%
13
Example 2 : At year end, last status observation should not be prior than 1 July of that year (older than 183 days)
2000 2001 2002 2003 2004 2005 2006 2007 20080%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Currency of Status Observations
UndefinedNotCurrentCurrent
b. Retention Rule
A retention rule enforces the desired depth of the historical data. Retention rules are usually expressed in the form of constraints on the overall duration or the number of records in the history.
c. Granularity rule
A granularity rule requires all measurement periods in an accumulator history to have the same size.
E.g. If the surveillance implies a six monthly visit to each homestead, is that in fact the case?
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210
50
100
150
200
250
300
350
Inter-round Visit Gap, Interquartile Ranges5-95 percentile extremes
Round n-1:n
Days
14
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Round Granularity
d. Continuity rule
A continuity rule prohibits gaps and overlaps in accumulator histories. Continuity rules require that the beginning date of each measurement period immediately follows the end date of the previous period.
For example for internal migrations, the next residency episode must follow directly on the previous.
Example : Internal migrations
Continuity nContinuity 18 657
Indicator=1−DiscontinuityTotal
Discontinuity 6 430Total 25 087Indicator 74.4%
2000 2001 2002 2003 2004 2005 2006 2007 2008 20090%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Continuity
e. Timestamp pattern rule
A timestamp pattern rule requires all timestamps to fall into a certain prescribed date interval, such as every March or every other Wednesday or between the first and fifth of each month. Occasionally the pattern takes the form of minimum or maximum length of time between measurements. For example,
15
participants in a medical study may be required to take blood pressure readings at least once a week. While the length of time between particular measurements will differ, it has to be no longer than seven days.
Example : Similar to granularity rule, homestead has to be visited at least once every six months.
Semester VisitsLocatio
nsVisited 194,238
Indicator=1−NotVisitedTotal
Not Visited 19,488Total 213,726Indicator 90.9%
Note : Care should be taken with the type of observations used to derive this measure. If for example only observation tied to residency and status observations are considered, those locations visited where no observation was recorded due to non-contact with the occupants will not be considered in this indicator.
20002000
20012001
20022002
20032003
20042004
20052005
20062006
20072007
20082008
20092009
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Proportion of Locations With At Least One Visit per Semester During Period of Occupancy
Year/Semester
f. Value Pattern Rule
Value histories for time-dependent attributes usually also follow systematic patterns. A value pattern rule utilizes these patterns to predict reasonable ranges of values for each measurement and identify likely outliers. Value pattern rules can restrict direction, magnitude, or volatility of change in data values.
i. Direction of Change
The simplest value pattern rules restrict the direction in value changes from measurement to measurement. A person's length is unlikely to decrease over multiple measures in time, same for educational attainment.
Example: Educational attainment cannot decline.
Direction Measures
Invalid 21,178
Indicator=1− InvalidTotal
Valid 128,613
Total 149,791
Indicator
85.9%
16
ii. Magnitude of Change
It is usually expressed as a maximum (and occasionally minimum) allowed change per unit of time.
Example : Educational attainment cannot increase by more than the difference in years between two observation dates.
Direction Measures
Valid 117,626
Indicator=1− InvalidDirection+ InvalidMagnitudeTotal
Invalid Direction
21,181
Invalid Magnitude
10,984
Total 149,791
Indicator 78.5%
g. Event History rulesi. Event Dependencies
Various events often affect the same objects and therefore may be interdependent. Data quality rules can use these dependencies to validate the event histories. E.g. An out migration event cannot be recorded for an individual without a prior birth or in- migration event.
Example : Outmigration events cannot be preceded by ‘Death, ’Visit’ or ‘Outmigration’ events.
Dependency
n
Correct 56,
817Indicator=1− Incorrect
TotalIncorrect 3
Total 56,
820
Indicator99.99
%
ii. Event Conditions
Events of many kinds do not occur at random but rather only happen under certain unique circumstances. Event conditions verify these circumstances.
Example: Birth spacing, the time between two subsequent pregnancies with a live birth outcome should not be less than 9 months (280 days).
Birth Spacing Pregnancies
Too Short 542
Indicator=1−TooShortTotal
Valid 40,935
Total 41,477
Indicator 98.7%
17
iii. Event-specific Attribute Constraints
Events themselves are often complex entities, each with numerous attributes.
Example: A pregnancy outcome event requires the mother to be of child bearing age.
Birth Spacing
Pregnancies
Valid 70,601
Invalid 1,285
Total 71,886
Indicator 98.2%
4. Rules for state-dependent objects
These rules place constraints on the lifecycle of objects described by so- called state-transition models.
State-dependent objects go through a sequence of states in the course of their life cycle as a result of various events. Data for the state-dependent objects is very common in real world databases and is also most error- prone. Various data quality rules can be implemented to validate such data. Some of these rules are rather simple, while others can be quite complex and vary significantly depending on the data structure. In all cases, data quality rules for state-dependent objects are key to successful data quality assessment, since data for such objects is typically very important and yet contains numerous "hidden" errors.
Not under surveillance
Under surveillance
(known location)
Dead
Census
Inmigration
Birth
Outmigration
Death
Under surveillance (unknown location)
Internal Outm
igration
Internal Inmigration
Visit
a. State domain constraint
A state domain constraint limits the set of allowed states to only those shown in the state- transition model. Invalid states are usually typos inside otherwise valid records. The true state can often be deduced based on the action value.
18
b. Action domain constraint
An action domain constraint limits the set of allowed actions to only those shown in the state-transition model. Invalid actions are usually typos inside otherwise valid records. The true action can often be deduced based on the state value.
c. Terminator domain constraint
A terminator domain constraint limits the set of allowed terminators, specifically states in which an object can start and end its life cycle. Invalid terminators often are a symptom of missing records at the beginning of the life cycle.
Example : Invalid states at first transition
To State Action
n
INV HMS 1,838
INV INM 51
INV INT 9,205
SLK DLV 16,430
SLK DSS 62,633
SLK INM 34,500
Total124,65
7Indicator
91.1%
d. State-transition constraints
These constraints limit state changes to those allowed by the state- transition model. For example, a person who is already out- migrated cannot be out-migrated again without being in- migrated in between. Invalid state-transitions often signify a missing action.
Example : Residency state transitions
Final State
Individuals
Invalid 16,409
Indicator=1− InvalidEndStateIndividuals
Indicator=1− InvalidTransitionTransitions
Valid 108,248
Total 124,657
Indicator
86.8%
State Transitions
Invalid 16,409
Valid 296,501
Total 312,910
Indicator
94.8%
19
Invalid Transition Causes
Invalid Reason Action
n %
Action disallowed if not under surveillance INT 11,329 69.0%
Invalid action HDS 1,16119.8%
Invalid action HMS 2,088
Action cannot start a residency if at unknown location
INM 1,620 9.9%
Temporal integrity violated HDS 3
0.9%Temporal integrity violated HMS 1
Temporal integrity violated INT 78
Temporal integrity violated OTM 64
Action condition violated INM 55 0.3%
Action cannot start a residency if already at known location
INM 40.1%
Action cannot start a residency if already at known location
INT 6
Total 16,409
e. State-action constraints
Require that each action is consistent with the change in the object state. For example, after an out migration, the state of an individual must be non-resident
f. Continuity rules
Prohibit gaps and overlaps in state-transition history. In other words, they require that the effective date of each state record must immediately follow the end date of the previous state record.
Example : See 3.d. Historical data, continuity rule
g. Duration rules
Put a constraint on the maximum and/or minimum length of time an object can stay in any specific state. The simplest form of the duration rule is the zero-length rule, which requires the length of time spent in each state to be greater than zero.
Example : Residency episode duration cannot be negative (end before start) or zero.
Duration Episodes
Valid 170,372
Indicator=1− InvalidEpisodesEpisodes
Invalid 281
Total 170,653
Indicator 99.8%
20
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201098.6%
98.8%
99.0%
99.2%
99.4%
99.6%
99.8%
100.0%
Resident Episode Durations
h. Action pre-conditions
The conditions that must be satisfied before an action can take place. E.g. Mother must be resident for a child to start residency with birth
Example : Mother’s state at child residency start, if child starts residency with delivery.
Mothers Children
Mother resident 14,696
Indicator=1−Mothernonresident+Motherunknown
ChildrenMother non-resident
1,619
Mother unknown 131
Total 16,446
Indicator 89.4%
i. Action post-conditions
These are the conditions that must be satisfied after the action is successfully completed.
5. General attribute dependency rules
Rules that describe complex attribute relationships, including constraints on redundant, derived, partially dependent, and correlated attributes.
a. Redundant attributes
Redundant attributes are data elements that represent the same attribute of a real world object. While attribute redundancy goes against basic data modelling principles, it is common in practice for several reasons. First, redundancy is widespread in “legacy” databases and certain systems that were converted from the “legacy” databases. Secondly, redundancy is often used even in modern relational databases to improve efficiency of data access, information presentation, and transaction processing. Finally, some data across different systems are invariably redundant. Comparison of redundant attributes is a sure way to identify (and eventually correct) numerous data problems.
Example: Link between mother and child explicit via MotherID and implicit via births and pregnancies, both these should be consistent.
21
Link Pairs
Linked 15746 Of the cases where residency start is Birth and it is linked to a Pregnancy, in one case this link between child and mother was not reflected in the MotherID of the child.
Indicator=1−NotLinkedPairs
Not linked
1
Total 15747
Indicator
99.99%
The converse is slightly more complex. Of the children born to the mother while she was resident, are all such children recorded as resident and the residency start marked as Birth? Whether this test is absolute will depend on the eligibility rules of the HDSS.
Link Pairs
Birth not linked 1750 Child resident by birth is not linked to a resident mother via a pregnancy.
Birth not resident
1102 Resident mother gave birth to a child that is not resident from birth.
Consistent 14696
Indicator=1−BirthsNotLinked+BirthsNotResident
PairsTotal 17548
Indicator 83.7%
b. Derived Attributes
Values of derived attributes are calculated based on the values of some other attributes. This approach is very common when the calculation is rather complex and involves data stored in multiple records of possibly multiple entities. Performing the calculation on the fly is then very inefficient. One of the most common special cases of derived attribute constraints is a balancing rule, which requires an aggregate attribute to equal the total of atomic level attribute values.
Example : Data should satisfy the demographic equation:
Populationt+1=Populationt+(Birthst−Deathst )+(Immigration t−Emigrationt)
Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
Population 0 66035 68027 67277 65785 64981 64916 65376 64289 63494
Start observation 62633 0 0 0 0 0 0 0 0 0
Births 1675 1723 1719 1641 1743 1748 1749 1692 1579 1101
Immigration 3887 5689 7033 6032 5111 5348 5511 5242 5279 4992
Deaths 886 1077 1098 1129 983 979 886 913 796 697
Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427
Population t+1 65386 67513 67487 66592 65452 65060 65530 65226 64649 63463
Balance -649 -514 210 807 471 144 154 937 1155
Indicator 99.0% 99.2% 99.7% 98.8% 99.3% 99.8% 99.8% 98.6% 98.2%
22
Provision made for contextual factors such as change in HDSS boundary and loss to follow-up:
Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
Population 0 66033 68025 67266 65654 64263 63436 65376 64289 63494
Start observation 62633 0 0 0 0 0 2239 0 0 0
Births 1675 1723 1719 1640 1729 1696 1685 1692 1579 1101
Immigration 3885 5689 7027 5974 4940 5069 5172 5242 5279 4992
Deaths 886 1077 1098 1129 983 979 886 913 796 697
Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427
Loss to Follow-up 56 50 96 78 93 67 204 538 635 36233
Population t+1 65328 67461 67383 66444 65043 63944 65682 64688 64014 27230
Balance -705 -564 117 790 780 508 306 399 520
Indicator 98.9% 99.2% 99.8% 98.8% 98.8% 99.2% 99.5% 99.4% 99.2%
c. Partially Dependant Attributes
The values of redundant and derived attributes are prescribed exactly by the dependency. Oftentimes, the relationships between attributes are not so exact. The value of one attribute may restrict possible values of another attribute to a smaller subset, but not to a single value.
Example : Certain causes of death are only possible for women and/or men, e.g. cancer of the cervix or causes related to maternal death.
Sex n
FEM 120 Causes of death that ought to be associated with women.
Indicator=1− MaleDeathsDeathsFemaleCauses
MAL 1
Total 121
Indicator
99.2%
d. Conditional Optionality
Conditional optionality represents situations where values of one attribute determine whether or not the other attribute must take Null or not-Null value (i.e., is the value to be prevented or required). Technically speaking, attributes with conditional optionality are a special case of partially dependent attributes discussed above.
e. Correlated Attributes
Values of one attribute can change the likelihood of values of another one, though not firmly restricting any possibilities. An example is the correlation between gender and first name. The majority of names are distinctly male or female. Thus there is a definite relationship between these attributes; however, the relationship is not exact in nature.
4.7 Total Data Quality Management : Theory
4.7.1 Learning Objectives1. Able to identify the role players in data quality and their respective roles2. Able to describe the basic principles of Total Data Quality Management3. Able to list and describe the steps in the Ten Step Approach to Data Quality Improvement
23
4.7.2 Content1. Role Players
a. Data Collectorsb. Data Custodiansc. Data Consumers
2. Total Data Quality Management Cyclea. Defineb. Measurec. Analysed. Improve
4.7.3 Pre-reading and Reference Material1. Carlo Batini, Monica Scannapieca. Data Quality. Concepts, Methodologies and Techniques.
2006. Springer Berlin. Pp 161-188.2. Danette McGilvray. Executing Data Quality Projects. Ten Steps to Quality Data and Trusted
Information. 2008. Morgan Kaufmann Burlington. Pp54-58.
24
Appendix A : Sample Database
BirthsResidentEpisode
Pregnancy
Birthweight
CensusRoundsCensusRound
StartDate
EndDate
DeathsResidentEpisode
DeathCause
DeathLocation
IndividualsIndividual
LastName
FirstName
Sex
DoB
EndDate
MotherID
FatherID
InMigrationsResidentEpisode
OriginLocation
OriginPlace
Reason
LocationsLocation
Latitude
Longitude
StartDate
EndDate
ObservationsObservation
Location
CensusRound
ObservationDate
Observer
ObservationType
OutMigrationsResidentEpisode
DestinationLocation
DestinationPlace
Reason
PregnanciesPregnancy
Individual
StartDate
FirstObservation
EndDate
TerminatingEventType
LastObservation
StillBorn
LiveBorn
BirthAttendant
BirthLocation
ResidentEpisodesResidentEpisode
Individual
Location
StartDate
StartPrecision
InitiatingEventType
FirstObservation
EndDate
EndPrecision
TerminatingEventType
LastObservation
StatusObservationsStatusObservationID
Individual
Observation
MaritalStatus
EducationLevel
25
Appendix B : SQL Scripts----region Attribute Domain Constraints----region Optionality Constraints----region Cause of Death example--SELECT DeathCause, COUNT(*) nFROM dbo.Deaths DGROUP BY DeathCauseORDER BY DeathCause --COUNT(*) DESC--SELECT DeathCause, MAX(C.Description) Description, COUNT(*) nFROM dbo.Deaths D LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY DeathCauseORDER BY DeathCause---- Final formulation--SELECT CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' END Cause, COUNT(*) nFROM dbo.Deaths D LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' END---- Data Quality Trend--SELECT YEAR(E.EndDate) Year, CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' END Cause, COUNT(*) nFROM dbo.Deaths D JOIN dbo.ResidentEpisodes E ON D.ResidentEpisode=E.ResidentEpisode LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY YEAR(E.EndDate), CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' ENDORDER BY YEAR(E.EndDate),Cause--endregion--region Complex example - Internal migration destination-- Destination location for internal migrationsSELECT DestinationLocation, COUNT(*) nFROM dbo.OutMigrations OM
26
JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisodeWHERE TerminatingEventType='INT'GROUP BY DestinationLocationORDER BY COUNT(*) DESC---- Grouped Destination--SELECT CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' ELSE 'Known' END Destination, COUNT(*) nFROM dbo.OutMigrations OM JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisodeWHERE TerminatingEventType='INT'GROUP BY CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' ELSE 'Known' END---- Further Investigation--SELECT CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' WHEN L.Location IS NULL THEN 'Location wrong' ELSE 'Known' END Destination, COUNT(*) nFROM dbo.OutMigrations OM JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode LEFT JOIN dbo.Locations L ON OM.DestinationLocation=L.LocationWHERE TerminatingEventType='INT'GROUP BY CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' WHEN L.Location IS NULL THEN 'Location wrong' ELSE 'Known' END--endregion--endregion--region Format ConstraintsSELECT COUNT(*) Total, SUM(CASE WHEN PATINDEX('%[^a-zA-Z '']%',LastName)>0 THEN 1 ELSE 0 END) InvalidFROM dbo.Individuals--endregion--region Valid Value ConstraitsSELECT InitiatingEventType, COUNT(*) nFROM dbo.ResidentEpisodesGROUP BY InitiatingEventType-- SELECT YEAR(StartDate) Yr, CASE WHEN InitiatingEventType='HMS' THEN 'Invalid' ELSE 'Valid' END Validity, COUNT(*) nFROM dbo.ResidentEpisodesGROUP BY YEAR(StartDate), CASE
27
WHEN InitiatingEventType='HMS' THEN 'Invalid' ELSE 'Valid' ENDORDER BY Yr,Validity---- Birth Weight--SELECT Birthweight/100 W100q, COUNT(*) nFROM dbo.Births B JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)WHERE StartDate BETWEEN '20000101' AND '20101231'GROUP BY Birthweight/100ORDER BY Birthweight/100--endregion--region Precision and Granularity Contraints--region Date of Birth-- Birth DateSELECT StartPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE InitiatingEventType='DLV'GROUP BY StartPrecisionORDER BY StartPrecision--endregion--region Complex example Migration Date Precision---- InMigrationSELECT StartPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE InitiatingEventType='INM'GROUP BY StartPrecisionORDER BY StartPrecision---- Internal InMigrationSELECT StartPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE InitiatingEventType='INT'GROUP BY StartPrecisionORDER BY StartPrecision---- OutMigrationSELECT EndPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE TerminatingEventType='OTM'GROUP BY EndPrecisionORDER BY EndPrecision---- Internal OutMigrationSELECT EndPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE TerminatingEventType='INT'GROUP BY EndPrecisionORDER BY EndPrecision---- Migration Precision by Time--WITH InPrecision AS (
28
SELECT YEAR(StartDate) Yr, StartPrecision [Precision], COUNT(*) n FROM dbo.ResidentEpisodes WHERE InitiatingEventType IN ('INM','INT') GROUP BY YEAR(StartDate),StartPrecision),OutPrecision AS ( SELECT YEAR(EndDate) Yr, EndPrecision [Precision], COUNT(*) n FROM dbo.ResidentEpisodes WHERE TerminatingEventType IN ('INT','OTM') GROUP BY YEAR(EndDate),EndPrecision),InScore AS ( SELECT Yr, SUM((10-[Precision])*n) Score, SUM(9*n) MaxScore FROM InPrecision GROUP BY Yr),OutScore AS ( SELECT Yr, SUM((10-[Precision])*n) Score, SUM(9*n) MaxScore FROM OutPrecision GROUP BY Yr)SELECT I.Yr, SUM(ISNULL(I.Score,0)+ISNULL(O.Score,0)), SUM(ISNULL(I.MaxScore,0)+ISNULL(O.MaxScore,0))FROM InScore I JOIN OutScore O ON (I.Yr=O.Yr)GROUP BY I.YrORDER BY I.Yr--endregion--endregion--endregion----region Relational Integrity Constraints--region Identity Rules-- Duplicate IndividualsSELECT *INTO IndividualComparisonFROM dbo.udfSeekDuplicates()--SELECT Similarity, COUNT(*) nFROM dbo.IndividualComparisonGROUP BY SimilarityORDER BY Similarity--SELECT C.IndA,C.IndB, I1.FirstName FirstNameA, I2.FirstName FirstNameB, I1.LastName LastNameA, I2.LastName LastNameB, I1.Sex SexA, I2.Sex SexB, I1.DoB DoBA, I2.DoB DoBBFROM dbo.IndividualComparison C JOIN dbo.Individuals I1 ON (C.IndA=I1.Individual)
29
JOIN dbo.Individuals I2 ON (C.IndB=I2.Individual)WHERE C.Similarity=0ORDER BY C.IndA,C.IndB--region AC SpecificSELECT C.IndA,C.IndB, I1.Name NameA, I2.Name NameB, I1.Sex SexA, I2.Sex SexB, I1.DoB DoBA, I2.DoB DoBBFROM dbo.IndividualComparison C JOIN ACDIS.dbo.vacNamedIndividuals I1 ON (C.IndA=I1.IIntID) JOIN ACDIS.dbo.vacNamedIndividuals I2 ON (C.IndB=I2.IIntID)WHERE C.Similarity=0ORDER BY C.IndA,C.IndB--endregionSELECT COUNT(*)FROM dbo.Individuals--endregion--region Reference Rules---- Child to Parent linkages-- MotherId on ChildSELECT CASE WHEN C.MotherID IS NULL THEN 'Unknown' WHEN M.Individual IS NULL THEN 'Missing' ELSE 'Known' END Mother, COUNT(*) nFROM dbo.Individuals C LEFT JOIN dbo.Individuals M ON (C.MotherID=M.Individual)GROUP BY CASE WHEN C.MotherID IS NULL THEN 'Unknown' WHEN M.Individual IS NULL THEN 'Missing' ELSE 'Known' END-- FatherId on ChildSELECT CASE WHEN C.FatherID IS NULL THEN 'Unknown' WHEN F.Individual IS NULL THEN 'Missing' ELSE 'Known' END Father, COUNT(*) nFROM dbo.Individuals C LEFT JOIN dbo.Individuals F ON (C.FatherID=F.Individual)GROUP BY CASE WHEN C.FatherID IS NULL THEN 'Unknown' WHEN F.Individual IS NULL THEN 'Missing' ELSE 'Known' END--endregion--region Cardinal Rules-- Incorrect formulationSELECT CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END Residency, COUNT(*) nFROM dbo.Individuals I LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)GROUP BY CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END--region Correct formulation
30
WITH UniqueResidencies AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes)SELECT CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END Residency, COUNT(*) nFROM dbo.Individuals I LEFT JOIN UniqueResidencies R ON (I.Individual=R.Individual)GROUP BY CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END---- Residency Cardinality--WITH ResidencyCount AS (SELECT I.Individual, COUNT(*) nFROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)GROUP BY I.IndividualUNIONSELECT I.Individual, 0 nFROM dbo.Individuals I LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)WHERE R.Individual IS NULL)SELECT n ResidencyCardinality, COUNT(*) CntFROM ResidencyCountGROUP BY nORDER BY n--endregion--endregion--endregion----region Rules for Historical Data--region Currency Rule---- Last visit of current residency episodesSELECT CensusRound, MIN(ObservationDate) MinDate, MAX(ObservationDate) MaxDateFROM dbo.ObservationsGROUP BY CensusRoundORDER BY CensusRound---- Start of previous round 13 Jul 2009SELECT CASE WHEN EndDate>'20090712' THEN 'Current' ELSE 'Not Current' END Currency, COUNT(*) nFROM dbo.ResidentEpisodesWHERE TerminatingEventType='VIS'GROUP BY CASE WHEN EndDate>'20090712' THEN 'Current'
31
ELSE 'Not Current' END---- Currency of Statusobservation, e.g. MaritalStatus--WITH YearEnds AS ( SELECT CAST('20001231' AS datetime) YearEnd UNION SELECT CAST('20011231' AS datetime) YearEnd UNION SELECT CAST('20021231' AS datetime) YearEnd UNION SELECT CAST('20031231' AS datetime) YearEnd UNION SELECT CAST('20041231' AS datetime) YearEnd UNION SELECT CAST('20051231' AS datetime) YearEnd UNION SELECT CAST('20061231' AS datetime) YearEnd UNION SELECT CAST('20071231' AS datetime) YearEnd UNION SELECT CAST('20081231' AS datetime) YearEnd UNION SELECT CAST('20091231' AS datetime) YearEnd),YearEndIndividuals AS ( SELECT DISTINCT Individual,YearEnd FROM dbo.ResidentEpisodes R CROSS JOIN YearEnds WHERE R.EndDate>=YearEnd AND R.StartDate<YearEnd),SOCurrency AS ( SELECT S.Individual,YearEnd, MIN(DateDiff(day,O.ObservationDate,YearEnd)) Currency FROM dbo.StatusObservations S JOIN dbo.Observations O ON (S.Observation=O.Observation) JOIN YearEndIndividuals I ON (S.Individual=I.Individual) AND (O.ObservationDate<=I.YearEnd) GROUP BY S.Individual,YearEnd)SELECT I.YearEnd, CASE WHEN C.Currency IS NULL THEN 'Undefined' WHEN C.Currency>183 THEN 'NotCurrent' ELSE 'Current' END Currency, COUNT(*) nFROM YearEndIndividuals I LEFT JOIN SOCurrency C ON (I.Individual=C.Individual) AND (I.YearEnd=C.YearEnd)GROUP BY I.YearEnd,CASE WHEN C.Currency IS NULL THEN 'Undefined' WHEN C.Currency>183 THEN 'NotCurrent' ELSE 'Current' ENDORDER BY I.YearEnd,CASE WHEN C.Currency IS NULL THEN 'Undefined' WHEN C.Currency>183 THEN 'NotCurrent' ELSE 'Current' END--endregion----
32
--region Granularity RuleWITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231'),MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)),MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound),MedianVisits AS (SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDateFROM MedianVisitDate),VisitGaps AS ( SELECT R1.Location, R1.CensusRound Rn, R2.CensusRound Rnn, DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity FROM MedianVisits R1 JOIN MedianVisits R2 ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1))SELECT Rn,Rnn,Granularity,COUNT(*) nFROM VisitGapsGROUP BY Rn,Rnn,GranularityORDER BY Rn,Rnn,Granularity---- Quality indicator based on granularity-- Gap should be +-15 days within 183 (twice yearly rounds)--SELECT Rnn CensusRound, CASE WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange' WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange' ELSE 'Outside' END Indicator, COUNT(*) nFROM dbo.vLocationVisitGapsGROUP BY Rnn, CASE WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange' WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange' ELSE 'Outside' ENDORDER BY Rnn, Indicator--endregion
33
--region Continuity RuleWITH NumberedEpisodes AS ( SELECT Individual, StartDate, InitiatingEventType, EndDate, TerminatingEventType, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY StartDate) RowNum FROM dbo.ResidentEpisodes)SELECT YEAR(E2.StartDate) Yr, CASE WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext' WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity' ELSE 'Continuity' END Continuity, COUNT(*) nFROM NumberedEpisodes E1 JOIN NumberedEpisodes E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)WHERE E1.TerminatingEventType='INT'GROUP BY YEAR(E2.StartDate), CASE WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext' WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity' ELSE 'Continuity' END ORDER BY Yr, Continuity--endregion--region Timestamp pattern ruleWITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231'),MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)),MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound),MedianVisits AS (SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDateFROM MedianVisitDate),Semesters AS ( SELECT 1 AS Semester, CAST('20000101' AS datetime) SemStart, DATEADD(day,-1,DATEADD(quarter,2,'20000101')) SemEnd
34
UNION ALL SELECT Semester+1 Semester, DATEADD(day,1,SemEnd) SemStart, DATEADD(day,-1,DATEADD(quarter,2,DATEADD(day,1,SemEnd))) SemEnd FROM Semesters WHERE SemStart<'20090701'),SemesterVisits AS ( SELECT Location,Semester,COUNT(*) n FROM MedianVisits V JOIN Semesters ON (MedianDate>=Semstart) AND (MedianDate<=SemEnd) GROUP BY Location,Semester)SELECT *FROM SemesterVisitsORDER BY Location,Semester--endregion--endregion--region Value Pattern Rule--region Direction of Change---- Example : Educational Attainment--WITH EducationStatus AS ( SELECT Individual,ObservationDate,Years, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum FROM dbo.StatusObservations SO JOIN dbo.Observations O ON (SO.Observation=O.Observation) JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel) WHERE NOT E.Years IS NULL)SELECT CASE WHEN E2.Years>=E1.Years THEN 'Valid' ELSE 'Invalid' END Direction, COUNT(*) MeasuresFROM EducationStatus E1 JOIN EducationStatus E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)GROUP BY CASE WHEN E2.Years>=E1.Years THEN 'Valid' ELSE 'Invalid' END----endregion--region Magnitude of ChangeWITH EducationStatus AS ( SELECT Individual,ObservationDate,Years, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum FROM dbo.StatusObservations SO JOIN dbo.Observations O ON (SO.Observation=O.Observation) JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel) WHERE NOT E.Years IS NULL)SELECT CASE WHEN E2.Years<E1.Years THEN 'Invalid Direction' WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate) THEN 'Invalid Magnitude' ELSE 'Valid'
35
END Direction, COUNT(*) MeasuresFROM EducationStatus E1 JOIN EducationStatus E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)GROUP BY CASE WHEN E2.Years<E1.Years THEN 'Invalid Direction' WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate) THEN 'Invalid Magnitude' ELSE 'Valid' END--endregion--endregion--region Event History Rule--region Event Dependencies---- Out migration not preceded by Death, Visit or Outmigration--WITH Events AS ( SELECT Individual, InitiatingEventType Event, StartDate EventDate, ResidentEpisode FROM dbo.ResidentEpisodes WHERE StartDate<>Enddate UNION ALL SELECT Individual, TerminatingEventType Event, EndDate EventDate, ResidentEpisode FROM dbo.ResidentEpisodes WHERE StartDate<>Enddate),NumberedEvents AS ( SELECT Individual, Event,EventDate, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY EventDate, ResidentEpisode) RowNum FROM Events)SELECT CASE WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect' ELSE 'Correct' END Dependency, --E1.Event, COUNT(*) nFROM NumberedEvents E1 JOIN NumberedEvents E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)WHERE E2.Event='OTM'GROUP BY --E1.Event CASE WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect' ELSE 'Correct' END--endregion--region Event Conditions---- Pregnancies with live births should be spaced by 9 months (280 days)--WITH NumberedPregnancies AS ( SELECT Individual,EndDate DeliveryDate, ROW_NUMBER()
36
OVER(PARTITION BY Individual ORDER BY EndDate) RowNum FROM dbo.Pregnancies WHERE LiveBorn>0)SELECT CASE WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort' ELSE 'Valid' END BirthSpacing, COUNT(*) PregnanciesFROM NumberedPregnancies P1 JOIN NumberedPregnancies P2 ON (P1.Individual=P2.Individual) AND (P1.RowNum=P2.RowNum-1)GROUP BY CASE WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort' ELSE 'Valid' END--endregion--region Event-specific attribute constraintsSELECT CASE WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid' ELSE 'Invalid' END BirthSpacing, COUNT(*) PregnanciesFROM dbo.Pregnancies P JOIN dbo.Individuals I ON (P.Individual=I.Individual)GROUP BY CASE WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid' ELSE 'Invalid' END--end region--endregion--endregion----region Rules for state-dependent objects--region State domain constraint--endregion--region Action domain constraint--endregion--region Terminator domain constraintSELECT ToState,Action,COUNT(*) nFROM dbo.udfStateTransitions('20000101')WHERE Transition=1GROUP BY ToState,ActionORDER BY ToState,Action--endregion--region State-transition constraints---- Individuals with invalid end states--WITH LastTransition AS ( SELECT Individual,MAX(Transition) LastTransition FROM dbo.udfStateTransitions('20000101') GROUP BY Individual)SELECT CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END Quality, COUNT(*) IndividualsFROM dbo.udfStateTransitions('20000101') T JOIN LastTransition LT ON (T.Individual=LT.Individual) AND (T.Transition=LT.LastTransition)GROUP BY
37
CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END---- Invalid transitions--SELECT CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END Quality, COUNT(*) TransitionsFROM dbo.udfStateTransitions('20000101') TGROUP BY CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END---- Breakdown of invalid transitions--SELECT InvalidReason,Action, COUNT(*) nFROM dbo.udfStateTransitions('20000101') TWHERE ToState='INV'GROUP BY InvalidReason, ActionORDER BY InvalidReason, Action---- Breakdown by surveillance round--SELECT O.CensusRound, SUM(CASE WHEN ToState='INV' THEN 1 ELSE 0 END) Invalid, SUM(CASE WHEN ToState='INV' THEN 0 ELSE 1 END) Valid, COUNT(*) TransitionsFROM dbo.udfStateTransitions('20000101') T JOIN dbo.Observations O ON (T.Observation=O.Observation)GROUP BY O.CensusRoundORDER BY O.CensusRound--endregion--region State-action constraints--endregion--region Continuity rules--endregion--region Duration rules---- Residency episode cannot be of zero or negative duration--SELECT YEAR(StartDate) Yr, CASE WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid' ELSE 'Invalid' END Duration, COUNT(*) EpisodesFROM dbo.ResidentEpisodesGROUP BY YEAR(StartDate), CASE WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid' ELSE 'Invalid' ENDORDER BY Yr,Duration--endregion--region Action pre-conditionsWITH ResidentBabies AS ( SELECT
38
Individual Baby,StartDate FROM dbo.ResidentEpisodes WHERE InitiatingEventType='DLV'),ResidentBabyMothers AS ( SELECT DISTINCT B.Baby, I.MotherID Mother FROM dbo.Individuals I JOIN ResidentBabies B ON (I.Individual=B.Baby) WHERE NOT MotherID IS NULL)SELECT CASE WHEN BM.Baby IS NULL THEN 'Mother unknown' WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident' ELSE 'Mother resident' END Mothers, COUNT(*) BabiesFROM ResidentBabies B LEFT JOIN ResidentBabyMothers BM ON (B.Baby=BM.Baby) LEFT JOIN dbo.ResidentEpisodes RE ON (BM.Mother=RE.Individual) AND (RE.StartDate<=B.StartDate) AND (RE.EndDate>=B.StartDate)GROUP BY CASE WHEN BM.Baby IS NULL THEN 'Mother unknown' WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident' ELSE 'Mother resident' END--endregion--region Action post-conditions--endregion--endregion----region General attribute dependency rules--region Redundant attributes---- Are all cases where residency is started by birth -- which is linked to a pregnancy and then to the mother, -- also reflected in the MotherID link of the child?--WITH DirectMCLink AS ( --76898 pairs SELECT MotherID, Individual ChildID FROM dbo.Individuals WHERE NOT MotherID IS NULL),IndirectMCLink AS ( --15747 SELECT DISTINCT P.Individual MotherID, R.Individual ChildID FROM dbo.Pregnancies P JOIN dbo.Births B ON (P.Pregnancy=B.Pregnancy) JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode))SELECT CASE WHEN D.MotherID IS NULL THEN 'Not linked' ELSE 'Linked' END Link, COUNT(*) PairsFROM IndirectMCLink I
39
LEFT JOIN DirectMCLink D ON (I.MotherID=D.MotherID) AND (I.ChildID=D.ChildID)GROUP BY CASE WHEN D.MotherID IS NULL THEN 'Not linked' ELSE 'Linked' END---- Of the children born to the mother while she was resident, -- are all such children recorded as resident -- and the residency start marked as Birth?WITH MotherBirths AS ( --21907 SELECT MotherID, Individual ChildID, DoB FROM dbo.Individuals WHERE NOT MotherID IS NULL AND DoB>='20000101' -- After start of DSS),BirthsDuringResidency AS ( --15798 SELECT B.* FROM MotherBirths B JOIN dbo.ResidentEpisodes R ON (R.Individual=B.MotherID) AND (B.DoB>=R.StartDate) AND (B.DoB<=R.EndDate)),ResidenciesFromBirth AS ( --16430 SELECT Individual ChildID FROM dbo.Births B JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode))SELECT CASE WHEN A.ChildID IS NULL THEN 'Birth not linked' WHEN B.ChildID IS NULL THEN 'Birth not resident' ELSE 'Consistent' END Link, COUNT(*) PairsFROM BirthsDuringResidency A FULL JOIN ResidenciesFromBirth B ON (A.ChildID=B.ChildID)GROUP BY CASE WHEN A.ChildID IS NULL THEN 'Birth not linked' WHEN B.ChildID IS NULL THEN 'Birth not resident' ELSE 'Consistent' ENDORDER BY Link--endregion--region Derived Attributes---- Data should satisfy the demographic equation---- Resident Population at start of yearSELECT 'Population' AS Component, SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0 END) Y2000,
40
SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesUNIONSELECT 'Start observation' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE InitiatingEventType='DSS'UNIONSELECT 'Births' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE InitiatingEventType='DLV'UNIONSELECT 'Immigration' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
41
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'UNIONSELECT 'Deaths' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE TerminatingEventType='DTH'UNIONSELECT 'Emigration' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'---- Taking into account contextual factors, such as change in DSS boundary--WITH CensoredEpisodes AS ( SELECT CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS' ELSE R.InitiatingEventType END InitiatingEventType, CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001' ELSE R.StartDate END StartDate, R.EndDate, R.TerminatingEventType FROM dbo.ResidentEpisodes R JOIN dbo.Locations L ON (R.Location=L.Location))SELECT 'Population' AS Component, SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0 END) Y2003,
42
SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesUNIONSELECT 'Start observation' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DSS'UNIONSELECT 'Births' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DLV'UNIONSELECT 'Immigration' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'UNIONSELECT 'Deaths' AS Component,
43
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='DTH'UNIONSELECT 'Emigration' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'---- Taking into account contextual factors and loss to follow-up--WITH CensoredEpisodes AS ( SELECT CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS' ELSE R.InitiatingEventType END InitiatingEventType, CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001' ELSE R.StartDate END StartDate, R.EndDate, R.TerminatingEventType FROM dbo.ResidentEpisodes R JOIN dbo.Locations L ON (R.Location=L.Location))SELECT 'Population' AS Component, SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0 END) Y2006,
44
SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesUNIONSELECT 'Start observation' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DSS'UNIONSELECT 'Births' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DLV'UNIONSELECT 'Immigration' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'UNIONSELECT 'Deaths' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
45
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='DTH'UNIONSELECT 'Emigration' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'UNIONSELECT 'Loss to Follow-up' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='VIS'---- Find 705 people present in 2001 in excess of expectations--WITH PresentIn2001 AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes WHERE StartDate<'20010101' AND EndDate>='20010101'),CameIn2000 AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes WHERE YEAR(StartDate)=2000 AND InitiatingEventType IN ('DSS','DLV','INM')),LeftIn2000 AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes WHERE YEAR(EndDate)=2000 AND TerminatingEventType IN ('DTH','VIS','OTM'))SELECT A.Individual
46
FROM PresentIn2001 A JOIN LeftIn2000 B ON (A.Individual=B.Individual)--SELECT *FROM dbo.ResidentEpisodesWHERE Individual=56179--endregion--region Partially Dependant AttributesSELECT D.DeathCause, C.Description, COUNT(*) nFROM dbo.Deaths D JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY D.DeathCause,C.DescriptionORDER BY n DESC--SELECT I.Sex, COUNT(*) nFROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual) JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode)WHERE DeathCause IN ('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57')GROUP BY I.Sex--SELECT I.Sex,C.DescriptionFROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual) JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode) JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)WHERE DeathCause IN ('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57') AND I.Sex='MAL'--endregion--endregion
Procedures, Views and User-defined FunctionsCREATE FUNCTION dbo.udfStateTransitions(@DSSStart datetime)RETURNS @Transitions TABLE ( [RecID] int IDENTITY (1, 1) NOT NULL PRIMARY KEY NONCLUSTERED, Individual int NOT NULL, Transition int NOT NULL, FromState char(3) NOT NULL, --NUS (Not under surveillance) --SLK (under surveillance location known) --SLU (under surveillance location unknown) --DTH (Death) --INV (Invalid state) ToState char(3) NOT NULL, Action char(3) NOT NULL, --DSS (Surveillance Start), --INM (Inmigration), --DLV (Delivery), --INT (Internal migration), --DTH (Death), --OTM (Outmigration), --VIS (Visit), --INV (Invalid action) TransitionDate datetime NOT NULL, Observation int NOT NULL, InvalidReason varchar(80) NULL)
47
AS BEGIN DECLARE @Individual int DECLARE @DoB datetime DECLARE @InitiatingEventType char(3) DECLARE @StartDate datetime DECLARE @TerminatingEventType char(3) DECLARE @EndDate datetime DECLARE @Transition int DECLARE @NextState char(3) DECLARE @CurrentState char(3) DECLARE @LastEvent char(3) DECLARE @LastIndividual int DECLARE @LastDate datetime DECLARE @FirstObservation int DECLARE @LastObservation int DECLARE C CURSOR LOCAL FAST_FORWARD FOR SELECT I.Individual,DoB, InitiatingEventType,StartDate,FirstObservation, TerminatingEventType,R.EndDate,LastObservation FROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual) ORDER BY Individual,StartDate,ResidentEpisode; OPEN C; SET @LastIndividual=-1; FETCH C INTO @Individual, @DoB, @InitiatingEventType, @StartDate, @FirstObservation, @TerminatingEventType, @EndDate, @LastObservation WHILE (@@FETCH_STATUS=0) BEGIN IF (@LastIndividual<>@Individual) BEGIN --next individual SET @CurrentState='NUS'; SET @LastDate=@DoB; SET @Transition=0; SET @LastIndividual=@Individual END; -- Do start event transition SET @Transition = @Transition+1; IF (@CurrentState='NUS') BEGIN IF (@InitiatingEventType='DSS' AND @StartDate=@DSSStart AND @Transition=1 AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType='INM' AND @StartDate>@DSSStart AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType='DLV' AND @StartDate=@DoB AND @Transition=1 AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
48
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType IN ('INT','DTH','OTM','VIS')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action disallowed if not under surveillance', @FirstObservation); END ELSE IF (@InitiatingEventType IN ('DSS','INM','DLV')) BEGIN --Invalid action condition SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action condition violated', @FirstObservation); END ELSE IF (@LastDate>@StartDate) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Temporal integrity violated', @FirstObservation); END ELSE BEGIN --Invalid event SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Invalid action', @FirstObservation); END; END; IF (@CurrentState='SLK') BEGIN IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency', @FirstObservation); END ELSE IF (@InitiatingEventType IN ('INT','INM','DLV','DSS')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency if already at known location', @FirstObservation); END
49
ELSE BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Invalid action', @FirstObservation); END END; IF (@CurrentState='SLU') BEGIN IF (@InitiatingEventType='INT' AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency', @FirstObservation); END ELSE IF (@InitiatingEventType IN ('INM','DLV','DSS')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency if at unknown location', @FirstObservation); END ELSE IF (@LastDate>@StartDate) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Temporal integrity violated', @FirstObservation); END ELSE BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Invalid action', @FirstObservation); END END; IF (@CurrentState='DTH') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation)
50
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'No transitions after terminating state', @FirstObservation); END; SET @LastDate=@StartDate; SET @CurrentState=@NextState; SET @Transition=@Transition+1; -- Do end event transition IF (@CurrentState='NUS') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Cannot be not under surveillance before residency end', @LastObservation); END; IF (@CurrentState='SLK') BEGIN IF (@TerminatingEventType='INT' AND @LastDate<@EndDate) BEGIN SET @NextState='SLU'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType='OTM' AND @LastDate<@EndDate) BEGIN SET @NextState='NUS'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType='VIS' AND @LastDate<=@EndDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType='DTH' AND @LastDate<=@EndDate) BEGIN SET @NextState='DTH'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType IN ('INM','DSS','DLV')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Action cannot end a residency', @LastObservation); END ELSE IF (@LastDate>=@EndDate) BEGIN SET @NextState='INV';
51
INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Temporal integrity violated', @LastObservation); END ELSE BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Invalid action', @LastObservation); END END; IF (@CurrentState='SLU') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Cannot be at unknown location before residency end', @LastObservation); END; IF (@CurrentState='DTH') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Cannot be dead before residency end', @LastObservation); END; SET @LastDate=@EndDate; SET @CurrentState=@NextState; FETCH C INTO @Individual, @DoB, @InitiatingEventType, @StartDate, @FirstObservation, @TerminatingEventType, @EndDate, @LastObservation END; CLOSE C; DEALLOCATE C; RETURNEND
CREATE FUNCTION dbo.udfSeekDuplicates ()RETURNS @Duplicates TABLE ( IndA int, IndB int, Similarity int)AS BEGIN DECLARE @Individual int DECLARE @LastName varchar(50) DECLARE @FirstName varchar(50) DECLARE @Sex char(3) DECLARE @DoB datetime DECLARE C CURSOR LOCAL FAST_FORWARD FOR SELECT Individual,LastName,FirstName,Sex,DoB FROM dbo.Individuals
52
ORDER BY Individual
OPEN C;
FETCH C INTO @Individual,@LastName,@FirstName,@Sex,@DoB; WHILE (@@FETCH_STATUS=0) BEGIN INSERT INTO @Duplicates SELECT @Individual,Individual, dbo.fnacLevenshtein(@LastName,LastName)+ dbo.fnacLevenshtein(@FirstName,FirstName)+ CASE WHEN @Sex=Sex THEN 0 ELSE 1 END + ABS(YEAR(@DoB)-YEAR(DoB)) + ABS(MONTH(@DoB)-MONTH(DoB)) + ABS(DAY(@DoB)-DAY(DoB)) FROM dbo.Individuals WHERE @Individual<Individual -- Do not re-evaluate inverse AND ABS(DATEDIFF(day,@DoB,DoB))<366 AND dbo.fnacLevenshtein(@LastName,LastName)<10 AND dbo.fnacLevenshtein(@FirstName,FirstName)<5 FETCH C INTO @Individual,@LastName,@FirstName,@Sex,@DoB; END; CLOSE C; DEALLOCATE C;
RETURNEND
CREATE VIEW dbo.vLocationVisitGapsAS WITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231' ), MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2) ), MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound ), MedianVisits AS ( SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDate FROM MedianVisitDate ) SELECT R1.Location, R1.CensusRound Rn,
53
R2.CensusRound Rnn, DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity FROM MedianVisits R1 JOIN MedianVisits R2 ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)
CREATE VIEW dbo.vLocationVisitGapsAS WITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231' ), MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2) ), MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound ), MedianVisits AS ( SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDate FROM MedianVisitDate ) SELECT R1.Location, R1.CensusRound Rn, R2.CensusRound Rnn, DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity FROM MedianVisits R1 JOIN MedianVisits R2 ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)
54