1 Census 1996, 2001 & Community Survey (CS) United Nations Regional Workshop on Census Data...
-
Upload
myron-smith -
Category
Documents
-
view
218 -
download
0
Transcript of 1 Census 1996, 2001 & Community Survey (CS) United Nations Regional Workshop on Census Data...
1Census 1996, 2001 & Community Survey (CS)
United Nations Regional Workshop on Census Data Processing Contemporary Technology from
Census Data Capturing and Editing:
A perspective of South Africa Data Processing SystemA presentation by
South African Data Processing TeamDar-es-Salaam, Tanzania, 9-13 June 2008
2Census 1996, 2001 & Community Survey (CS)
The presentation layout• Introduction• Data processing Goal• Planning phase• Design of Data Processing System• System Development & Testing • Implementation & Operations
– Process flow– Document Management System – Progress reporting – Tool of scanning– Exceptions– Quality Assurance (QA)
• Accounting or Balancing process• Data validation & Editing• Tabulation and output products
3Census 1996, 2001 & Community Survey (CS)
Introduction• Data Processing is considered as part of Survey operations
value chain (Proper define accountability structure);
• There is a define inter-dependency links with other Census sections (i.e. questionnaire design, Data collection,…);
• Heavily dependent on the available support in information technology around the country (Outsourcing of management of the system);
• Tight project management principle checking timeline, resources and detailed production lines
• Obliged to adapt on ever changing technology (1996 KFP, 2001 Census scanning, 2007 scanning with old scanner, 2011 Census scanning with upgraded scanners)
4Census 1996, 2001 & Community Survey (CS)
To accurately process or convert the statistical information from different collection tools such as the questionnaire into a comprehensive electronic data that is clean, accurate, consistent and reliable.
Goal of Data Processing
5Census 1996, 2001 & Community Survey (CS)
Planning phase• Going through the lessons learned from previous censuses and surveys (1996 Census,
2001 census, 2007 Community Survey)
– Preparation of processing site• In 1996 Census, distributed data processing centre in 9 provinces• In 2001 Census and 2007 CS centralized data processing centre
– Mode of Data Capturing• In 1996 Census Manual capturing (key from paper) running on SQL database with interface developed in
visual basic • In 2001 Census and 2007 CS: Use of proprietary scanning technology linked to Oracle database
– Census Budget• The 1996 Census budget estimated at 500 Million Rand • The 2001 Census budget estimated at 1.2 Billion Rand • The 2007 CS budget estimated at 600 Million Rand
– Human Resource• The 1996 Census have more staff for key from paper (options considered for Job creation across the
country)• The 2001 Census and 2007 CS has a reduced number of staff supporting the scanning technology working
on shifts
– Duration• The 1996 Census data capturing was planned for 12 Months• The 2001 Census was planned for 6 months. However, the period was extended 18 months due to not
tested new technology• The 2007 CS took only 3 months as planned
– Systems design and specifications• In 2001 Census, system specification & development was reviewed during implementation• In 2007 CS, most the system specification & development were completed and tested before the production
6Census 1996, 2001 & Community Survey (CS)
Planning phase• Strategic plan
– There is a policy on standard procedure in terms of documentation, process flow, metadata, concepts managed by DMID (Data Management and Information Delivery) project ;
– Common strategy across surveys program by using scanning technology with control of transaction in database
– Moving toward a Centralised Corporate data processing Centre ( store management,…)
– Accounting of production transaction tracking the questionnaire using a barcode;
– Measurement of quality at each process of the production (;
– Having a permanent team of data processors in order to keep the experience while build the capacity;
– Acceptance of any system or module into production after it has gone through testing phase to avoid the experience of 2001 Census of untested system;
7Census 1996, 2001 & Community Survey (CS)
• Operational plan & Budget
– Since 2001 Census, there is a detailed activities list, sub-activities and tasks with timelines (start and end date) and responsible persons;
– Since 2007 CS, each activity is linked to budget in what is called activity/task base costing;
– Since 2007 CS, there is an independent and dedicate team in charge of project management and monitoring of activities;
– A list of documents and other derivable are submitted to the project management team (PMO) to keep track of the progress;
– Development of performance indicators for PMO to track which will give the daily production counts per process;
– Based on activities costing, the budget has never been an issue, except in 2001 Census when the project went beyond the planned period.
Planning phase
8Census 1996, 2001 & Community Survey (CS)
Design of Data Processing• The data processing team get the user requirement from the questionnaire design team and
data collection team;
• The team comprised by Data processors, system analyst (1 person), programmers, statisticians and Data technologist (IT technicians) prepare the overall design specifications;
• The data processing team is supplemented by the Data Collection team in the management of production and staff management on flow;
• The scanning module of the system is out source (in 2001 a consortium of companies , but in 2007 CS one company was accountable);
• In 2007 CS, the data processing project management was controlled in house to avoid the lack of accountability observed in 2001 Census where it was done by external (PROCON)
• Since the workflow was changing in 2001Census, a approved workflow with the operation procedure manual was ready in 2007 CS before the start of production
• The functional specifications where done only in 2007 CS as part of overall system specification;
• The technical specifications were completed for as build system in 2001 Census whereas the 2007 CS specification where done before any implementation.
9Census 1996, 2001 & Community Survey (CS)
System Development & Testing• In 1996 Census, the system development was done by in-house team supported by the Swedish
consultants;
• In 2001 Census, the system development was outsourced to local based company that put together a consortium of service providers in project management, system development, scanner specialist/maintenance, Image and recognition software;
• In 2007 CS, the system development and project management was done in-house outsourcing only the scanning software and scanner maintenance;
• In 1996 Census, only unit test was conducted whereas the 2001 Census, most of the tests (unit tests, production load test,…) were conducted while in production already;
• In 2007 CS, all tests were done before production:
– For instance, the background colour drop out was tested in 2007 CS whereas the blue colour background in 2001 Census required a blue light in scanner (tested after months of production);
– The decision on exception handling was done during production in 2001 Census (rescan or transcription) whereas in 2007 CS, the questionnaire were send to Key From Paper (KFP) or Key From Image (KFI);
– In 2007 CS, false-positive reading were reduce by introducing voting rules between two different recognition engines whereas in 2001 Census all false-positive reading were sent to verification stage (Tiling and Completion/Key correction)
10Census 1996, 2001 & Community Survey (CS)
Implementation & Operations• Operational procedures
– In 2001 Census, operational procedure manual was prepared during production;– In 2007 CS, the operational procedure was in place before training– Every day production account is produced (extraction from Oracle database)
• Recruitment– In 1996 Census, the production staff were selected based on keying speed only;– In 2001 Census, the production staff were recruited based on each process
requirement; – In 2007 CS, the production staff have versatile skills as data processors and can move
between processes depending on needs as determined by the flow manager.– In 2001 Census, staff worked 24 hours, 7 days a week in 3 shifts. In 2007 CS, only one
shift was managed to meet the deadline.
• Training– IN 2001 Census, training was conducted by service provider (PROCON) whereas in
1996 Census and 2007 CS, the training was by the senior data processors, system developers and statistician who were part of the design team.
• Preparation of work environment– In 1996 Census used 9 sites. In 2001 Census, one warehouse site and in 2007 CS,
there were two sites (one for main storage and the other for the production. – Site preparation including partitioning, hardware and networking installed one month
before the end of Census field operation.
11Census 1996, 2001 & Community Survey (CS)
High Level Process Flow
5CSAS
Receiving
3CSAS
Barcode Matching
2DMS Box
content re-verification
3Primary
Preparation
4Guillotine
1 Document Management System
8Post-Scanning
Box check
4CSAS
ReducedContent
Verification
6Secondary Preparation
7Scanning
43Determine Cause
44Fast Track
Back in sto
re &
CS
AS
ch
ecks
13Normalisation
15Database
17Key from Image
16Sample
42Output Database
2CSAS
Content Verification
CSAS Database
DPS Database
Export
Accuracy Check (95%)
5De-activate qn from box and
Create box for KFP
Transfer only if >= 95%Exception Path
Physical Movement Path
Data Movement Path
CS Data Processing Process Flow Ver 3.0
Failed <= 95%Passed >= 95%
11Recognition/
Interpret
12Verify/Tiling/Completion
14Transfer/Export
10Image
validations + Form
Identification
Validation Fail
9De-activate qn from box and
Create KFP box
CSAS
Scanning & Recognition
DMS
QA & Validation + Key from Image
Scannable
Con
firm
ed g
ood
box
Content confirmed
Sca
nnab
le
New
KF
P b
ox c
reat
ed
Key from Paper
Post-scanning check pass
Box barcode checking pass
Non Scannable (damaged)
New KFP box created
No
Yes
Yes
No
Yes
Yes
Yes
Yes
No
NoNo, scannable
No
No, damaged
18Key from Paper
(1st Capture)
19KFP Database 1
20Key from Paper (2ndt Capture)
21KFP Database 2
Accuracy Check (95%)
Passed >= 95%
Fai
led
<=
95%
22Unrecognisable cases
Sign-off (Balancing)
Split coding fields
12Verify/Tiling/Completion coded fields
12Automated
coding
12On screen
coding (resolution)
Automated Coding
44Manual editing
Operations cont…
12Census 1996, 2001 & Community Survey (CS)
Document Management System• Tracking the documents movement across processes
• Accounting of all transactions including the production staff login;
• Database driven (SyBase in 1996, Oracle in 2001 and 2007);
• Progress reporting per user, per function and per process
• Reporting gives the performance management (speed, time, production unit,…)
Operations cont…
13Census 1996, 2001 & Community Survey (CS)
Progress reportingCS 2007 Data Processing Progress Report Date:
Constants 17,387 284,244 251,775 32,469
Boxes
Overall Progress
Process Buffer Work in Progress
Completed Total Outstanding Percentage Complete
Start Date End Date Start Date End Date Required target per
day Average production per
day
Estimated days for completion based on
current average New End Date
(Lead) or Delay Days in Terms of
Complete A B C D E F G H I J K L M N O P
Check Out DPC store - - 17,387 17,387 - 100.00% 03 Apr 07 01 Jun 07 02 Apr 07 11-Jun-07 446 - - - 0Check In OM store - - 17,387 17,387 - 100.00% 03 Apr 07 01 Jun 07 02 Apr 07 11-Jun-07 446 - - - 0Store audit verification - - 17,387 17,387 - 100.00% 04 Apr 07 01 Jun 07 04 Apr 07 11-Jun-07 458 - - - 0Primary preparation and Guillotine - - 17,304 17,304 - 100.00% 11 Apr 07 01 Jun 07 11 Apr 07 11-Jun-07 524 - - - 0KFP (Manual Coding, FirstCapture and Second Capture)
- - 61 83 22 73.49% 28 May 07 29 Jun 07 12 Jun 07 4 - - - 0
Secondary preparation - - 17,304 17,304 - 100.00% 12 Apr 07 08 Jun 07 12 Apr 07 11-Jun-07 468 - - - 0Scanning - - 17,015 17,015 - 100.00% 12 Apr 07 08 Jun 07 12 Apr 07 11-Jun-07 460 - - - 0Post Scan Checkout - - 17,015 17,015 - 100.00% 12 Apr 07 08 Jun 07 12 Apr 07 11-Jun-07 460 - - - 0
Daily ProgressProcess Fri 22 Jun Mon 25 Jun Tue 26 Jun Wed 27 Jun Thu 28 Jun Fri 29 Jun Mon 02 Jul
Check Out DPC store - - - - - - - Check In OM store - - - - - - - Store audit verification - - - - - - - Primary preparation and Guillotine - - - - - - - KFP (Manual Coding, FirstCapture and Second Capture)
- - - - - - -
Secondary preparation - - - - - - - Scanning - - - - - - - Post Scan Checkout - - - - - - -
Questionnaires
Overall Progress
Process Buffer Work in Progress
Completed Total Outstanding Percentage Complete
Start Date End Date Start Date End Date Required target per
day Average production per
day
Estimated days for completion based on
current average New End Date
(Lead) or Delay Days in Terms of
Complete A B C D E F G H I J K L M N O P
Store Audit Verification - - 284,244 284,244 - 100.00% 04 Apr 07 01 Jun 07 04 Apr 07 11-Jun-07 7,480 - - - 0Primary preparation and Guillotine - - 281,840 281,840 - 100.00% 04 Apr 07 01 Jun 07 11 Apr 07 11-Jun-07 7,417 - - - 0KFP (Manual Coding, FirstCapture and Second Capture)
- - 1,902 2,404 502 79.12% 28 May 07 29 Jun 07 12 Jun 07 - 120 0 - - 0
Secondary Preparation - - 281,840 281,840 - 100.00% 12 Apr 07 08 Jun 07 12 Apr 07 11-Jun-07 7,617 - - - 0Scanning - - 247,746 247,746 - 100.00% 12 Apr 07 08 Jun 07 12 Apr 07 11-Jun-07 6,696 - - - 0Character Recognition 1,141 - 246,605 247,746 1,141 99.54% 13 Apr 07 15 Jun 07 24 Apr 07 - 6,043 1,905 1 2-Jul-07 18Tiling 218 - 246,387 247,746 1,359 99.45% 16 Apr 07 15 Jun 07 24 Apr 07 - 6,194 2,742 0 2-Jul-07 17Completion 1,605 - 244,782 247,746 2,964 98.80% 16 Apr 07 15 Jun 07 24 Apr 07 - 6,194 3,080 1 2-Jul-07 18Data export -1,740 - 246,522 247,746 1,224 99.51% 07 May 07 15 Jun 07 07 May 07 - 9,910 7,524 0 2-Jul-07 17Sampled for QA - 246,200 247,746 1,546 99.38% 14 May 07 22 Jun 07 - - 9,910 10,605 0 2-Jul-07 10Quality Assurance 3,500 - 242,700 247,746 5,046 97.96% 21 May 07 22 Jun 07 - - 12,387 25,190 0 2-Jul-07 10Coding - - - 247,746 247,746 0.00% 21 May 07 22 Jun 07 - - 12,387 - - - 0KFI (Exception resolution) 586 - 49,390 49,976 586 98.83% 14 May 07 29 Jun 07 - - 1,666 3,233 0 2-Jul-07 3Sign-off - - 240,423 247,746 7,323 97.04% 02 May 07 29 Jun 07 - - 6,520 3,233 2 4-Jul-07 5
Number of boxes to be processedNumber of Questionnaires in boxes (Final Result Code 0-9)Number of Questionnaires in boxes (Final Result Code 1,4)-ProcessableNumber of Questionnaires in boxes Not Processable
Monday 02 Jul 2007
Planned Actual
Actual Planned
Operations cont…
14Census 1996, 2001 & Community Survey (CS)
Progress reportingOverall Progress - Boxes
0%
20%
40%
60%
80%
100%
Check Out DPCstore
Check In OMstore
Store auditverif ication
Primarypreparation and
Guillotine
KFP (ManualCoding,
FirstCapture andSecond Capture)
Secondarypreparation
Scanning Post ScanCheckout
Completed Work in ProgressBuffer Outstanding Overall Progress - Questionnaires
-20%
0%
20%
40%
60%
80%
100%
Store AuditVerif ication
Primarypreparation
and Guillotine
KFP (ManualCoding,
FirstCaptureand Second
Capture)
SecondaryPreparation
Scanning CharacterRecognition
Tiling Completion Data export Sampled forQA
QualityAssurance
Coding KFI (Exceptionresolution)
Sign-off
Completed Work in Progress
Buffer Outstanding
Daily Progress - Questionnaires
-
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
Fri 22 Jun Mon 25 Jun Tue 26 Jun Wed 27 Jun Thu 28 Jun Fri 29 Jun Mon 02 Jul
Data exportTilingCompletionTarget Automated Processes
Daily Progress - Boxes
0
10
20
30
40
50
60
70
80
90
100
Fri 22 Jun Mon 25 Jun Tue 26 Jun Wed 27 Jun Thu 28 Jun Fri 29 Jun Mon 02 Jul
Target Manual Processes
KFP (Manual Coding, FirstCaptureand Second Capture)
.
Cumulative Progress - Scanning - Planned vs Actual
-
50,000
100,000
150,000
200,000
250,000
300,000
Thu 1
2 A
pr
Fri 1
3 A
pr
Sat 14
Apr
Sun 1
5 A
pr
Mon
16 A
pr
Tue 1
7 A
pr
We
d 1
8 A
pr
Thu 1
9 A
pr
Fri 2
0 A
pr
Sat 21
Apr
Sun 2
2 A
pr
Mon
23 A
pr
Tue 2
4 A
pr
We
d 2
5 A
pr
Thu 2
6 A
pr
Fri 2
7 A
pr
Sat 28
Apr
Sun 2
9 A
pr
Mon
30 A
pr
Tue 0
1 M
ay
We
d 0
2 M
ay
Thu 0
3 M
ay
Fri 0
4 M
ay
Sat 05
Ma
y
Sun 0
6 M
ay
Mon
07 M
ay
Tue 0
8 M
ay
We
d 0
9 M
ay
Thu 1
0 M
ay
Fri 1
1 M
ay
Sat 12
Ma
y
Sun 1
3 M
ay
Mon
14 M
ay
Tue 1
5 M
ay
We
d 1
6 M
ay
Thu 1
7 M
ay
Fri 1
8 M
ay
Sat 19
Ma
y
Sun 2
0 M
ay
Mon
21 M
ay
Tue 2
2 M
ay
We
d 2
3 M
ay
Thu 2
4 M
ay
Fri 2
5 M
ay
Sat 26
Ma
y
Sun 2
7 M
ay
Mon
28 M
ay
Tue 2
9 M
ay
We
d 3
0 M
ay
Thu 3
1 M
ay
Fri 0
1 J
un
Sat 02
Jun
Sun 0
3 J
un
Mon
04 J
un
Tue 0
5 J
un
We
d 0
6 J
un
Thu 0
7 J
un
Fri 0
8 J
un
Production Days
Nu
mb
er
of
Qu
esti
on
nair
e
Planned
Actual
Operations cont…
15Census 1996, 2001 & Community Survey (CS)
Tool of scanning
• Kodak 9520D
– Used in 2001 Census;– Used in 2007 CS;
• Differential scanner feeding (pages by page and/or batches);
• Barcode recognition at scanning time
Operation cont…
16Census 1996, 2001 & Community Survey (CS)
Exceptions• Questionnaires transcription:
– Damaged – Unscannable – Inconsistent page numbering– Unique identifier (barcode)
• Key From Paper (KFP):
– Poor image quality– Faint writing– Missing pages– Wrong unique identifier (Enumerator Area, Dwelling Unit & Household Number)
• False-Positive reading:
– Poor software recognition – Poor image quality – Incomplete text (character)– Unrecognized mark or character
• Failed quality checks:
– Quality rate below the threshold (95% accurate rate)
Operation cont…
17Census 1996, 2001 & Community Survey (CS)
Quality Assurance (QA)
• In 1996 Census, the quality was implemented as part of double keying without any measurement attached to it;
• In 2001 Census, the quality was measured at scanning time (check image quality) and after data capturing (Key from Image of the sampled batches (the threshold was 97%);
• In 2007 CS, the sample of captured were subjected to second capture comparing with the first capture where the agreement rate was determined (the threshold was 95% reduced due to good image quality):
– For scanned cases: sample keyed from image and calculation of an
agreement rate;
– For exceptional cases: 100% double keyed from Paper and calculation of agreement rate;
Operation cont…
18Census 1996, 2001 & Community Survey (CS)
Accounting or Balancing process
• After capturing, each questionnaire is accounted for linked to the geographical area (EA) and having the correct data structure (household, persons,….) before any export;
• In 1996 Census, the export process of captured data into SAS/ASCII for for post-capture process (editing and tabulation);
• In 2001 Census, the balancing process took longer because of lack of reference link to the EA of postal questionnaire (self-enumeration);
• In 2007 CS, a Census and Administration System (CSAS) assisted in getting the full account of the questionnaires linked to their referenced geography;
19Census 1996, 2001 & Community Survey (CS)
Data validation & Editing• In 1996, the adopted strategy was not to impute any derived value. Only
manual editing was allowed;
• In 2001 Census, based on editing specification with the assistance of US Bureau of Census, an automated editing was implemented using IMPS/CSpro. The 2007 CS follows the same approach used in 2001 Census.
• Different editing report with imputation rates were produced to an editing committee which come out with the rule to apply for correction;
• In 2001 Census and 2007 CS, limited manual editing were implemented;• One of key editing rule is the removal of minimal processable cases caused by
poor recognition or false-positive reading;
• Though the editing has been in ASCII, the output database is exported with in different formats (i.e. users driven: ASCII, Oracle, SAS, Oracle,…) linked with the metadata;
20Census 1996, 2001 & Community Survey (CS)
Data validation & editing Cont…
21Census 1996, 2001 & Community Survey (CS)
Tabulation and output products
• Since Stats SA policy is to give access to data users, the strategy is to put the Census data in different format to increase accessibility and promote data use;
• In 1996 Census, the output database was packaged in SuperCorss database and a set of aggregated databases put on CD for the users;
• In 2001 Census, the access to the data was increased by adding on the online processing tabulation tools (PX-Web), the SuperCross, reduced ASCII file,….
• In 2007 CS, the data is also available in different format (SuperCross, ASCII file, PX-web and other map/chart linked tools
• The traditional reports are still produced based on tabulation plan/output reports
22Census 1996, 2001 & Community Survey (CS)
Benefit of scanning Technology
• Improve the Quality of the Data
• Save Time
• Reduce Costs
23Census 1996, 2001 & Community Survey (CS)
THANK YOU!