IPUMS-International Integration Process
-
Upload
chase-williamson -
Category
Documents
-
view
35 -
download
0
description
Transcript of IPUMS-International Integration Process
DATA
METADATA
Data files
Data dictionary
Enumeration forms
Enum. instructions
Sample information
Batch samples
Reformat data
Donation
Draw sample
Confidentiality A
Translate to English
Images to editable files
Ipums data dictionary
Code clean-up
Verify data
Confidentiality B
Tag enumeration text
Document unharmonized variables
Harmonize codes
Variable programming
Constructed variables
Variable descriptions
Sample design
Input material1
Pre-processing2
Standardization3
Integration4
Batch Samples
In spring we identify the samples to integrate the following year.
Samples are processed as a group – one per year. The entire batch of samples is processed through each stage before we proceed to the next step.
There is little flexibility in the work process. If a sample is not available for processing during the earliest stages of integration, it cannot be included in the data release for that year.
Original Input Data
Some examples of differing file formats:
• SPSS and SAS system files
• Redatam-format
• IMPS format
• Records that combine household and person characteristics
• Separate files for persons, households (and dwellings, buildings)
• Different types of records (mortality or migration)
• Separate files for different administrative units
Reformatting: Original Data File
Reformatting: Data File after Reformatting
geography housing
person (head)person (child)person (child)
geography housing person (head)geography housing person (child)geography housing person (child)geography housing person (head)geography housing person (spouse)geography housing person (child)geography housing person (child)
geography housingperson (head)person (spouse)person (child)person (child)
(Brazil 1980)
(Person records only; household data duplicated on person records)
Reformatting: Rectangular Sample
dwellinghouseholdperson (head)person (spouse)person (child)
householdperson (head)person (child)
person (head)person (spouse)
dwellinghousehold
dwelling householdperson (head)person (spouse)person (child)dwelling householdperson (head)person (child)dwelling householdperson (head)person (spouse)
(Chile 1992)
(Separate dwelling and household records)
Reformatting: Dwelling-Household-Person Sample
serial 001 headserial 001 spouseserial 002 headserial 002 childserial 003 head
serial 001 geog & housingserial 002 geog & housingserial 003 geog & housing
serial 001 householdserial 001 headserial 001 spouse
serial 003 household
serial 002 householdserial 002 headserial 002 child
serial 003 head
Household File
Person File
(Brazil 2000)
Reformatting: Merge Household and Person Files
geog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog person
personhousehold
householdperson
person
person
person
household
household
household
(Mexico 1960)
geog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog person
(Individuals only; not organized in households)
Reformatting: Persons not Organized in Households
Donation and Error Correction
Data are tested for errors that affect structural integrity, such as merged households, unmatched person and household records, corrupted records, etc. Such errors often do not affect tabulations, but create inconsistencies across records within households that affect sophisticated analyses.
• Some problems can be resolved with custom programming.• Other problems are resolved by donating (substituting) a donor household for the corrupted one.
Households are divided into strata based on predictor variables. Donors are drawn from the same strata as the corrupted household, ensuring they share key characteristics.
If a sample is drawn from the full census, a substitute donor record is used; if we are already starting with a sample, the donor record is duplicated. A flag indicates that a record was duplicated.
Drawing a Sample
About one-third of IPUMS samples are drawn from full-count data.
After reformatting, we draw a systematic sample of every Nth dwelling to yield the desired sample density – typically 10%.
If the input data are not full-count (for example, they include only the long-form records), the sample design might have to account for differing sample densities between areas.
Very large dwelling units (over 30 persons) are sampled at the individual level – not as intact units – in order to reduce sampling error. Every Nth individual is taken.
Confidentiality Measures: A
Swap a small percentage of cases between geographic areas.
Reorder households within geographic areas.
Suppress low-level geographic variables.
Suppress any variable deemed too sensitive by the National Statistical Office.
Encrypt all versions of the data prior to the imposition of these confidentiality measures.
Code Clean-Up: Recoding Unharmonized Variables
CR840018 label cos1984Marital status P
75
<tt>0 NIU B=Under age 101 Consensual union 1=Consensual union2 Married 2=Married3 Separated 3=Separated4 Divorced 4=Divorced5 Widowed 5=Widowed6 Single 6=Single9 Missing 0=[undocumented]9 " 8=[undocumented]
</tt>
• Recode the input variables to conform to some basic standards for treatment of missing values, etc.• Recode stray values into a consolidated missing category as appropriate.• Convert non-numeric characters to numeric.
Most recoding is performed using a data translation matrix like the one below for Marital Status in 1984 Costa Rica. If the recoding requires more complex logic, use custom programming.
Verify Data: Unharmonized Variables
Examine the marginal frequencies of every input variable.
Analyze the data universe for each variable – the population at risk of having a response. Determine the theoretical universe from enumeration materials or other documentation, then empirically determine any discrepancies from that universe.
Document the universe for each variable and any other observations.
Confidentiality Measures: B
Recode geographic units to ensure small localities cannot be identified (typically those with fewer than 20,000 persons).
For recent censuses:
Identify cells that represent very small numbers of persons in the population. Code them to a residual category or combine them.
Top- or bottom-code continuous variables that have a long tail that could identify small subpopulations.
Suppress specific categories of variables as requested by the National Statistical Office.
MARST Marital Status
code label CN82A403 CO73A411 KN89A413 MX70A402 US90A425
100 SINGLE/NEVER MARRIED 1=never married 4=single 1=single 9=single 6=never married200 MARRIED/IN UNION210 Married (not specified) 2=married 2=married 3=monogamous 1=married211 Civil 3=only civil212 Religious 4=only religious213 Civil and religious 2=civil and religious214 Polygamous 3=polygamous220 Consensual union 1=free union 5=free union300 SEPARATED/DIVORCED 3=sep. or divorced310 Separated 6=separated 8=separated 3=separated321 Legally separated322 De facto separated330 Divorced 4=divorced 5=divorced 7=divorced 4=divorced400 WIDOWED 3=widowed 5=widowed 4=widowed 6=widowed 5=widowed999 UNKNOWN/MISSING 0=missing 6=unknown B=blank 1=unknown
ChinaChina19821982
ColombiaColombia19731973
KenyaKenya19891989
MexicoMexico19701970
U.S.A.U.S.A.19901990
Harmonize Codes: Translation Matrix for Marital Status
Variable Programming
Some variable manipulations are too complex to be handled using the translation matrix tables. Typically these involve continuous variables or recoding logic that refers to multiple variables. This programming is written in C++.
Pernum Relate Age Sex Marst Chborn1 head 46 male married n/a2 spouse 44 female married 33 aunt 77 female widow 74 child 15 female single 05 child 13 female single n/a6 child 11 male single n/a
Pernum Relate Age Sex Marst Chborn1 head 46 male married n/a2 spouse 44 female married 33 aunt 77 female widow 74 child 15 female single 05 child 13 female single n/a6 child 11 male single n/a
Spouse’s
Mother’s Father’s
Location
21
0
000
Location
Location
000 0
00
2 111
22
(Colombia 1985)
(Simple household)
Constructed “Pointer” Variables
Pernum Relationship Age Sex Marst Chborn1 head 53 female separated 62 child 28 male single n/a3 child 22 male single n/a4 child 21 male single n/a5 child 25 female married 26 child-in-law 28 male married n/a7 grandchild 3 male single n/a8 grandchild 1 male single n/a9 non-relative 32 female separated 2
10 non-relative 10 male single n/a11 non-relative 5 female single n/a
Location
Location
Location
00
0
006500000
011110550
99
00066
00000
Spouse’s Father’sMother’s
(Complex household)
(Colombia 1985)
Constructed “Pointer” Variables
. C006-EA-TYPE N 13 1 RURAL 1 URBAN 2 . C007-HHOLDNUM N 14-16 3 HHOLD-CODE 001:999 . (record type) A 17 1 . .age 2 Data Dictionary: REAL1 IMPS Version 3.1 . Created: 31/10/95 11:57:21 . Record Name: POP-RECORD Record Type: 2 .------------------------------------------------------------------------------- .tem (occurs) Data Item . Subitem (occurs) Type Position Len. Dec. Value Name Values .------------------------------------------------------------------------------- POP1 A 18-67 50 . P00-LINENUMBER N 18-19 2 0 LINE-NUMBER 01:49 . P10-RELATIONSHI N 20 1 0 HEAD 1 SPOUSE 2 SON-OFHEAD 3 DAU-OFHEAD 4 FATHER 5 MOTHER 6 OTHERRELATIVE 7 NOTRELATED 8 NR 9 . P11-SEX N 21 1 0 MALE 1 FEMALE 2 NR 9 . P12-AGE N 22-23 2 0 UNDERONE 00 YEARGIVEN 01:96 OVR97 97 NR 99
Original Data Dictionary – Kenya 1989
Line No.
Item Data type and
Item Len.
Signification and values
1. MAPA N 6 010001- 47XXXX number of the file, where : - 01- 47 is the code of the county
- 0001-XXXX is the code of the census sector within the county
2. CLAD N 3 The order number of the building in the file 3. LOC N 3 The order number of the dwelling within the building 4. RT N 1 Record type value: 4 5. P00 N 1 The order number of the household in the dwelling 6. PNR N 2 The order number of the person in the household 7. P01 N 2 Relationship with the household head:
. household head 1 . husband / wife 2 . son / daughter 3 . son in law / daughter in law 4 . grandson / granddaughter 5 . father / mother 6 . grandfather / grandmother 7 . brother / sister 8 . brother in law / sister in law 9 . father in law / mother in law 10 . other relative 11 . non-related person 20
8. P05 N 1 Situation at the census moment: . present 1 . temporally absent from the household: - left in other place of the country 2 - left abroad 3 . absent for a long time: - for working 4 - for studies 5 - other reason 6
Original Data Dictionary – Romania 1992
======================================= year: 1982, sample: 1%, record: individual, variable: age Length: 3 Start: 7 Age in years 0..99 ======================================= year: 1982, sample: 1%, record: individual, variable: race Length: 2 Start: 10 Ethnicity 01: Han 21: Va 41: Tajik 02: Mongol 22: She 42: Nu 03: Hui 23: Gaoshan 43: Uzbek 04: Tibetan 24: Lahu 44: Russian 05: Uygur 25: Sui 45: Ewenkei 06: Miao 26: Dongxiang 46: Benglong 07: Yi 27: Naxi 47: Baoan 08: Zhuang 28: Jingpo 48: Yugur 09: Bouyi 29: Kirgiz 49: Gin 10: Korean 30: Tu 50: Tatar 11: Man 31: Daur 51: Derung 12: Dong 32: Mulam 52: Orogen 13: Yao 33: Qiang 53: Hezhen 14: Bai 34: Bulang 54: monba 15: Tujia 35: Salar 55: Lhoba 16: Hani 36: Maonan 56: Jino 17: Kazak 37: Gelao 97: Other Unidentified 18: Dai 38: Xibe 98: Naturalized Foreigners 19: Li 39: Achang 20: Lisu 40: Pumi ======================================= year: 1982, sample: 1%, record: individual, variable: regstats Length: 1 Start: 12 Registration Status 1: Residing and registered here 2: Residing here over 1 year, but registered elsewhere. 3: Residing here less than 1 year, absent from the registration place 1 year or more. 4: Living here with registration unsettled 5: Used to reside here; is now abroad with no local registration =======================================
Original Data Dictionary – China 1982
25 CLAVE DE PARENTESCO CATALOGO DE PARENTESCO (CATPAREN.TXT) PRIMER DIGITO IGUAL A: 1 JEFE(A) 2 ESPOSA(O) O COMPAÑERA(O) 3 HIJO(A) 4 SIRVIENTE 5 SIN PARENTESCO 6 OTRO PARENTESCO 7 PERSONA SOLA 9 PARENTESCO NO ESPECIFICADO 26 SEXO 1 HOMBRE 2 MUJER 27 EDAD AÑOS CUMPLIDOS 999 EDAD NO ESPECIFICADA 28 LUGAR DE NACIMIENTO CATALOGO DE PAISES (CATPAISE.TXT) 001..032 ENTIDADES DEL PAIS 033..099 ENTIDAD INSUFICIENTEMENTE ESPECIFICADO 100..998 OTRO PAIS 999 NO ESPECIFICO LUGAR DE NACIMIENTO 29 LUGAR DE RESIDENCIA ANTERIOR CATALOGO DE PAISES (CATPAISE.TXT) 001..032 ENTIDADES DEL PAIS 033..099 ENTIDAD INSUFICIENTEMENTE ESPECIFICADO 100..998 OTRO PAIS 999 NO ESPECIFICO LUGAR DE RESIDENCIA ANTERIOR
Original Data Dictionary – Mexico 1990
Enumeration Form: Original File
Enumeration Instructions: Original File (Mexico 1990)
Sample Information – from Statistical Office
Sample information is difficult for the IPUMS project to collect. Often only limited information can be gleaned from available documentation. It is extremely helpful when countries collate the information themselves, as was done below by the Netherlands:
Translate Documents to English
Many countries provide their census documentation in English. For those that do not, the IPUMS project hires translators from around the world. Often these are persons currently or formerly associated with National Statistical Offices. Some common languages are translated by staff in Minnesota.
Editable Enumeration Form – In English
5. Number of Rooms
How many rooms are used for sleeping without counting hallways? _____ Write the number
Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen
_____Write the number
6. Access to water Read all of the options until you get an affirmative answer. Circle only one answer
1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other
Answers 3, 4, 5, 6 continue with number 8
7. Water supply How many days of the week is water available? Circle only one answer
1 Daily 2 Every third day 3 Twice a week 4 Once a week 5 Occasionally
IPUMS Data Dictionary
Rec Var Col Wid Value Value_Label Value_Label_Original Freq Svar P relate 36 2 Relationship to household head P01-Parentesco con el jefe(a) CR00A400 1 Head (male or female) Jefe o jefa 960,098 2 Spouse or partner Esposo(a)/compañera 680,217 3 Child or stepchild Hijo(a)/hijastro 1,763,230 4 Son-in-law or daughter-in-law Yerno o nuera 23,644 5 Grandchild Nieto(a) 140,300 6 Parent or parent in-law Padres o suegros 44,393 7 Other relative Otro familiar 117,223 8 Domestic servant or relative Serv.Domestico o su familiar 11,884 9 Other non-relative Otro no familiar 69,190 P sex 38 1 Sex P02-Sexo CR00A401 1 Male Masculino 1,902,614 2 Female Femenino 1,907,565 P bpl 39 1 Place of birth P04-Lugar de Nacimiento CR00A403 1 In this same canton Mismo canton 2,303,784 2 In another canton Otro canton 1,209,934 3 In another country Otro pais 296,461 P ethnic 40 2 Ethnic group P06-Etnia CR00A408 1 Indigenous Indigena 63,876 2 Black or Afrocostarican Negra o Afrocostarricense 72,784 3 Asian China 7,873 4 None of the above Ninguna anterior 3,568,471 9 Unknown Ignorado 97,175 P indigsp 42 2 Speaks Indigenous language P06b-Habla lengua indigena CR00A410 1 Yes, speaks Indigenous lang Si habla lengua indígena 15,806 2 No, does not speak Indigenous lang No habla lengua indígena 13,768 9 Unknown Ignorado 3,554 10 [no label] 3,777,051
XML-Tagged Enumeration Form
5. Number of Rooms <svar v="MX00A016" a="all"> How many rooms are used for sleeping without counting hallways?
<i1> _____ Write the number </i1>
</svar> <svar v="MX00A017" a="all"> Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen
<i1> _____Write the number </i1>
</svar> <svar v="MX00A018" a="all"> 6. Access to water Read all of the options until you get an affirmative answer. Circle only one answer
<i1> 1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other </i1>
Answers 3, 4, 5, 6 continue with number 8 </svar>
Document Unharmonized Variables
The enumeration form and instruction text provides most of the documentation for the unharmonized input variables.
Other documentation is written as needed to clarify the interpretation of the variable for users.
We also empirically determine the universe of persons or households with valid values for each variable.
Variable Description (Literacy)<vardesc> <var> LIT </var> <desc> LIT indicates whether or not the respondent could read and write in any language. A person is typically considered literate if they can both read and write. All other persons are illiterate, including those who can either read or write but cannot do both. </desc> <comp> Some samples provided more specific criteria than others with respect to the level of ability that should constitute literacy. Typically, the instructions appear to be aimed at distinguishing persons who have memorized how to write their signature or recognize certain words from those that can truly write and comprehend text they read. In 1999 Vietnam, all persons with 5 or more years of schooling are automatically considered literate. </comp> <comp.bra> All Brazilian censuses consistently stipulated that to be considered literate a person must be able to read and write a simple note in any language. Persons are not literate if they can only write their name or if they once learned to read and write but have since forgotten. </comp.bra> <comp.chn> The Chinese census instructions supplied explicit criteria for defining literate and semi-literate persons, who are combined in the data as "illiterate." The instructions stated that illiterate and semi-literate persons were those who knew fewer than 1500 words and could not read "simple language books and newspapers or write a simple message." </comp.chn>
Sample Design