INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson.
-
date post
18-Dec-2015 -
Category
Documents
-
view
217 -
download
3
Transcript of INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson.
How big ? Jobcentre & client activity data - 11 million
people
80 Gbytes of data per month on DLT
Processing 20 tables & 250m records per month
Largest tables have 80 million records, 5 Gbytes
Key extract retrievals take 6 - 20 hours to run
Full SPSS extracts for clients - 5 Gbytes
Twenty five SIR databases - over 100 Gbytes
from 1 Mbyte to 15 Gbytes
Evaluation Database - Overview Background & project requirements
Setting the ‘environment’
Visual PQL
The ‘Data Dictionary’
Program generators
Fuzzy matching
Use of the data
New developments
Questions
Background Government program started in January 1998
New Deal for Young People was set up to encourage and assist unemployed groups into full time sustainable employment. It gives unemployed people aged 18-24 the opportunity to develop their potential, gain skills and experience and find work.
Employment Service needed an evaluation database
March 98 - ORC issues first extract with 18000 clients
June 98 - program expanded to over 25’s
June 99 - expanded to cover all Jobcentre clients
October 00 - Source database migrates from six regional Ingres databases to single Oracle database
Contract extension -press notice The new contract will run to May 2004, with an option to run a further
two years to May 2006.
ORC International’s database tools are designed to help the ES evaluate the service it provides to all clients registered on the Government’s JobCentre computer system.They allow the ES to regularly monitor clients with sustainable jobs, the effectiveness of equal opportunities measures, and the relationship between job vacancies and the labour market skills base.
“It is our intention to create a project web site, which will allow multiple level access to different categories of users. This will include project documentation and progress reports, access to tabulations and small data extracts, and customer feedback areas, as well as links to other related sites.”
ORC International is part of Opinion Research Corporation, which was founded in 1938 with offices in the United States, Europe, Asia, Latin America and Africa, the Company provides integrated marketing services to both businesses and governments in more than 100 countries. http://www.orc.co.uk
Project Requirements LMS (Labour Market System) is a multi-user transaction system used in Jobcentres
Needed new database with evaluative functionality
Linked to additional data-sets including clerical
Flexibility to change structure periodically
Combine cross regional records for the same client
Extracts provided to ES for statistical purposes
Setting the environmentPROGRAM
PQL CONNECT DATABASE 'CLMIDS’ PREFIX '<CLI_MIDS>'
END PROGRAM
SET DATABASE CLMIDS
SET PROCFILE '<LMSPROC>’ | reset to main procedure file
( main procedure file held in separate .SR4 file )
Midlands database prefix
retval = globals ( ’CLI_MIDS ’ , '\\urmston\d\50413\sirdb\client\')
and so on for the procedure file & other databases
Visual PQL (1)
execute dbms ‘CALL ddict.cprog ($’ + tablenam + ‘$,$’ + region + ‘$,$’ + extract + ‘$,$’ + newdata + ‘$,$’ + editdata + ‘$,$’ + dbnum + ‘$)’
Visual PQL (2)
Linking SIR to other software - Winzip
open inf /dsnvar=fnamein /iostat=ierr1 /write /lrecl=300
ifthen ( ierr1 ne 0)
. write 'File not found ’
write // 'Unzipping file using winzip
pql escape "C:\Program Files\WinZip\WINZIP32.EXE -e zipfn fnamein "
write 'Back to sir !!!!!! '
c Build in a loop to check that input file now unzipped and readyc wait and repeat if not
else. write 'File opened OK ' fnamein /endif
The ‘Data Dictionary’
Data Dictionar
y
Tablename
Field names
Data types
Start / End Columns
Sort ids
Include / exclude
& date flags
Generate Schemas
Edit flags
Automatic schema generation Source field names, labels data types read from
HTML into data dictionary database
Program to create SIR variable names
strip out underscore characters ‘_’
trims field length to eight
if duplicated replaces eighth character with a number
Program sets column positions based on field types
Data types converted from Ingres / Oracle to SIR
Procedure for Date Integers and true date fields
Load programs - the report files (1)
Client_action record
Read 78224996
\ fixed_width_data \ 2001_05 \north: 19156652
east: 6949927
west: 11515552
south: 11774525
northwest: 11352025
midlands: 12147512
oracle: 5328791
dodgy: 12
Load program generation (2)
Develop standard file naming & directory location
Read pre -processing report files to pick up numbers of records to be loaded
Detect any missing report files using iostat values
Write loading program code to pql files
Read these pql files back into the procedure file
Load generator -finishing touches (3)
open inf6 / dsn = 'Q:\ps\50413\ProgGen\ldmidlands1.txt' / read / lrecl = 250open ouf6 / dsn = 'Q:\ps\50413\ProgGen\ldmidlands2.txt' / write / lrecl = 250
write ( ouf6 ) 'PROCEDURE INITLOAD.<6>:T’ | add the procedure header line / 'call initload.dropall' | add call to drop all databases // 'call initload.connmids' / | add call to connect midlands
loop. read (inf6, iostat = ierr6) textline(a250) | copy the rest of the program. if (ierr6 ne 0) exit loop . write (ouf6) textlineend loop
write (ouf6) 'call initload.dropmids' | add module call to drop database / 'END PROCEDURE’ | add the END PROCEDURE
close (inf6) | close the input and output filesclose (ouf6)
pread 'Q:\ps\50413\ProgGen\ldmidlands2.txt’ | Pread the SIR pql
Load program - Midlands database (4)
call initload.dropall | drop all connected databases
call initload.connmids | connect the midlands database
call initload.update (\\Urmston\d\50413\, sir_input_files\lms\2001_04\Midlands\edited\,
d:\50413\sirdb\client\Midlands\log\2001_04\, Midlands_client2001_04ORCIDsort, edt, 0.75, 1, 1, 181450, 0)
call initload.update (\\Urmston\d\50413\, sir_input_files\lms\2001_04\Midlands\edited\, d:\50413\sirdb\client\Midlands\log\2001_04\, Midlands_client_action2001_04ORCIDsort, edt, 0.75, 2, 1, 236835, 0)
……. | more call to load data
call initload.dropmids | disconnect the midlands database
Fuzzy matching Linking data by best combinations of Nino, name,dob
Stripping non significant text
blanks, apostrophes, hyphens
Methods grown organically on case by case basis
Variety of scoring methods eg surname matches
HODGSON 4 points direct matches
HODGESON 3points * 0.9 for misaligned -> 6.7/8 =84%
Flexible generic modules applicable to all match types
Reporting of false positive and negative matches
Manual review of near /doubtful matches
Generic fuzzy matching - issues Key fields different sizes and names across
applications
Some key fields absent or with high % missing values
Quality of key fields varies widely
Matching varies from a handful to millions of records
Bringing Access and SIR together - Visual PQL ?
How to assess ‘false’ positives when no other common fields in data sets being matched
-> set of core procedures with options to bypass ?
General Issues Continual growth in database & extract size
Data Irregularities
Embedded carriage returns in text fields
Date formats (American /English)
The team
Keeping the routine /repeated processing ‘interesting’
Mushrooming similar code - keeping it generic /’black box’
SIR
several large tabfiles required - largest currently 11Gb
Some retrievals crash with sir.exe error 1 time in 5 -why?
P4 /Windows 2000 mixed performance
New Developments (1) - Project web site Management tool for display of project statistics
Focus for collecting project documents ( Word )
FAQ’s
Glossary of abbreviations & acronyms ?
which variable should I use for this ?
How is leaving date derived ?
Where can I get the latest data dictionary for NDYP ?
What’s the ORC variable name for expct_start_date ?
Small sampling /extraction tools
Links to other related government sites
New Developments (2) - Processing Full benefits data set arrives on Monday
Increasingly complex extracts in SPSS and SAS
Moving more of the data processing to Linux
Generic fuzzy matching tools
Adding other data , deprivation index , surveys
Sir 2002 - reading an 800 byte record into a single string
Secondary indexing /lookups - replace Case 0
Better linking of the processing zipping /spss /excel /HTML
Questions
INSIGHTINSIGHT BEYONDBEYOND MEASUREMEASURE
Larger than the Evaluation Database ?Presented By
Adrian Hodgson