Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR...

38
Working with the 2001 Licensed Individual SAR • Coverage and quality • SAR data issues • Analysing SAR data • Software • The other datasets…

Transcript of Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR...

Page 1: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Working with the 2001 Licensed Individual SAR• Coverage and quality • SAR data issues• Analysing SAR data• Software• The other datasets…

Page 2: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

The SARs• Introduced from ’91 Census as

alternative to tabular outputs– Improved flexibility– Huge sample sizes– Only released following demonstration of

non-disclosiveness

• Content and access methods of ’01 data much more affected by confidentiality– Less detail on many variables in the licensed

files– Codebook online

Page 3: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

2001 Files

• Data available for download– Individual licensed SAR – On their way

• Household licensed SAR – under special license from the UK Data Archive

• Small area microdata file

• If you need more detail – Controlled Access Microdata Samples– Individual file– Household file (version 1)

Page 4: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Census coverage• Major effort to improve coverage in 2001• One Number Census• Use of large Census Coverage Survey to

correct census results, 300K households– Design independent of census; – Used matched census and CCS data to

estimate total population in each area,– adjusted all results for census non-response

using imputation of households and individuals

– Results in final database for UK adjusted for non-response

Page 5: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Census coverage• Coverage before imputation:

– 94% households returned forms, with another 4% estimated to be in households identified by enumerators.

• Response rate lowest for– Young people in their early 20s (men aged

20-24 resp. rate of 87%)– Inner London (resp rate of 78%)

• Once imputed cases are included estimated to be 100% coverage

Page 6: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Population base

• One population base: usual residents– differs from 1991 when user had to

chose either present or usual resident base

• Students enumerated at term time address– And are included in the data. Use

stulaway>1 to exclude those other than usual residents

• Communal establishments are included in the indivividual file

Page 7: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Implications for 2001 SARs

• 1991 SARs selected from 10% sample– Did not include imputed households– 96% coverage

• 2001 SARs selected from 100% ONC database– 94% response; 6% imputed– Imputed individuals/hholds are identified– Imputed items are flagged

Page 8: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Two kinds of imputation

• Entire individual or household may be imputed as part of ONC– Complete records copied from

enumerated individuals/hhold– Variable oncperim

• Variables imputed when information missing

Page 9: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Edit• 13.7 million edit procedures undertaken

– 28% population had 1+ items imputed– Common:

• Missing prof quals set to none• Carer set to no where missing (unless economic

activity also missing)• Travel to work set to ‘work mainly at/from home’

where workplace was ‘mainly at/from home– Others

• 14k people multi-ticked ‘sex’ (so imputed)• 6k children had marital status changed to single

• impossible values set to missing then imputed

• Missing values are imputed on the basis of similar local cases

• does not remove unlikely values

Page 10: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Item imputation

For census output database as a whole:

• One or more items imputed for 28% of the population

• Employment variables most affected:– Industry ever worked: 18%– Occupation ever worked: 14%– Workplace size: 9%

• Under-enumerated groups are most imputed, esp. single people

Page 11: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Can I tell what/who has been imputed?

• Oncperim records whether an individual has been imputed as part of the ONC– Copies entire record from census

database

• ‘z’ variables identify whether individual has imputed information on a specific variable– Parallel set of variables– zethew, zage0

Page 12: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Ethnic Group for England and Wales * ethew imputation flag Crosstabulation Count

ethew imputation flag

not imputed imputed Total

White 1434719 36098 1470817 Mixed 17891 2373 20264 Asian or Asian British

66556 3638 70194

Black or Black British

32656 2607 35263

Ethnic Group for England and Wales

Chinese or Other ethnic group 12323 1517 13840

Total 1564145 46233 1610378

Crosstab ethnic group (ethew) by imputation flag (zethew)

Page 13: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Percentage with ethnicity variable imputed, 2001 SARs

Not imputed imputed

White 97.5 2.5

Mixed 88.3 11.7

Asian 94.8 5.2

Black 92.6 7.4

Chinese/Other

89.0 11.0

All 97.1 2.9

Page 14: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Percentage ONC imputed, 2001 SARsNot ONC imputed

ONC imputed

White 94.8 5.2

Mixed 91.5 8.5

Asian 84.6 15.4

Black 76.5 13.5

Chinese/Other

85.6 14.4

All 93.8 6.2

Page 15: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Should I use imputed individuals or variables?

• Imputation of individuals is designed to compensate for under-enumeration -using imputed cases will give results

comparable with national data - will help overcome bias from non-

response

• Imputed variables are generally reported as accurate - in general we advise using imputed information

Page 16: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Ethnicity • But doubt over imputed ethnic

group• Simpson and Akinwale used

Longitudinal Study to compare 1991 ethnic group with imputed 2001 ethnic group

•Majority of imputed records are ‘wrong’

•Recommend not using imputed records for minority groups

www.statistics.gov.uk/events/ls_census2001/agenda.asp– SARs Percentage ethnic group imputed:– 2.5% white; 7.4% black; 11.7% mixed

Page 17: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

PRAMMing• PRAMMing is perturbation designed to

deal with very unusual cases, eg widowed 16-year olds

• Avoids additional broad-banding• Perturbation is constrained to

– preserve univariate distributions– Preserve multivariate distributions on control

variables– prevents strange results (like 5 year old

widows)

• Affects 15 variables– Primary economic activity – 1% cases

Page 18: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

The z-variables

• PRAMMed variables are flagged along with imputed variables – Cannot distinguish them

• Imputation flags are stored in variables with z prefix

• Two versions of the download file– use the larger *-impflag-*.extension

version if interested in imputation/PRAMMing

Page 19: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

General advice

• If unsure about impact of PRAMMing and imputation – Do a sensitivity test– use the z var to exclude cases with

imputed variables and then repeat your analysis

– Use ONCPERIM to exclude imputed individuals and repeat your analysis

Page 20: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

National variation• There is one file for the whole UK• Some variables are country specific:

– Irish language

• Other variables have national variations– educational qualifications– ethnicity – Watch out for the E,W,S and N suffixes!

• Sampling fraction is not quite consistent across countries!– Unlikely to result in major bias of

proportions– Will not gross up to census figures

Page 21: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Sampling fraction: by country & sex

England Male 3.097

England Female 3.092

Wales Male 3.089

Wales Female 3.098

Scotland Male 3.210!

Scotland Female 3.232!

N Ireland Male 3.125

N Ireland Female 3.065

total 3.105

Page 22: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

How does the SARs compare to the aggregate

data?Tables of comparisons between the licensed individual SAR and the aggregate tables available online in the user guide. Results are very similar, with occasional deviations from 95% ci.• Looked at univariate distribution of economic

activity, general health, marital status and ethnicity

• No proportion significantly different from aggregate data at UK level

• By country 9/107 cells are significantly different – slightly over 5% - will be looking to see if PRAMMING is to blame

Page 23: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Get to know the data

• Use the documentation • SARs User Guide

– Use Census schedules to check questions – Check univariate frequencies – Do exploratory analyses – Contact [email protected] if you

can’t find the information you need in the online documentation

• Contact [email protected] if you think there is a problem with the data

Page 24: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

SARs as a LARGE dataset• 1.8 Million cases can cause trouble!• Use Nesstar to do initial data exploration • Extract a subset using NESSTAR or take a

subset from the downloaded file • For serious analysis using a syntax

( or .do) file to record syntax makes re-running easier – Create a single syntax file which starts with the

original data– Use file naming conventions that will enable

you to trace versions– Keep a record of work done

Page 25: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

SARs as sample data

Geographically stratified sample– approximates to simple random

sample– no clustering in Individual file– Household file – clustering within

households– Although large sample you may have

small sample sizes when using sub-groups

– use standard errors and confidence intervals

Page 26: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Comparisons between 1991 and 2001

• Population base changed– Imputation (no imputed values in 1991 SARs)– Students – enumerated at term-time address – Residents only (choice in 1991)

• Variable continuity– Variable names have been changed where the

variable is not exactly the same – Some variables (e.g. age, LLI) are easy to

compare by grouping 1991 values– Some variables are harder to compare as the

question has changed (eg qualifications)

Page 27: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Ethnicity 91/01

• Different questions asked in 1991 and 2001

• No agreed and perfect correspondence

• Simpson and Akinwale use LS to show how 1991 maps on to 2001www.statistics.gov.uk/events/ls_census2001/

agenda.asp

Page 28: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Software options• Supported packages

– Nesstar– NSDstat– SPSS– Stata

• Other options– Import or Stat/transfer to another

package– Use Nesstar to save to SAS or Statistica– unless you use a v. small subsample

the SARs will be too big for most spreadsheets!

Page 29: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Looking forward: Moving forward

• Controlled Access Microdata Samples• Household SARs • Small Area Microdata sample• Learning and Teaching

Page 30: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

CAMS content

• Controlled Access Microdata designed for professional researchers:

• Access in safe setting only• Specification on SARs website• Individual file and Household file

Page 31: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Content of CAMs files

• Files contains much more detail; e.g.– Individual year of age (topcoded at 95)– FULL coding on country of birth– SOC Unit Goup– Local authority geography– Index of Multiple Deprivation for SOAs– Index of Multiple Deprivation for

migrants last address

Page 32: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Controlled Access• CAMS is managed by ONS• Data is accessed at

London/Titchfield/Newport/Southport in Virtual Laboratory setting on a server

• New bases soon • Virtual lab looks like a standard windows

interface• Use SPSS/Stata in usual way• output checked for confidentiality before

release• Further information and appropriate forms at

http://www.statistics.gov.uk/census2001/sar_cams.asp

• Contact [email protected] for more details

Page 33: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

CAMS Good practice

• Use the licensed SARs...– to exhaust the potential of other

datasets– to write your syntax files

• check the disclosure guidelines before writing your file

• Avoid complex tables– small cell counts aren’t reliable– unique cells will usually be suppressed

• Do use models

Page 34: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Household SAR

• 1% of households and all individuals• Allows linkage between individual in

hholds• Will be available SOON under special

license• Similar detail to Individual SAR• Specification of Household SAR on

website

Page 35: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

The hierarchy of the household file

Household 1North West

Social rented

Household 2Wales

Owner occupier

Person 1HoH

Female28

No quals

No LTILL

Person 2Son of HoH

Male12N/A

No LTILL

Person 1 HoHMale33

Degree

No LTILL

Person 2Spouse of HOH

Female31

DegreeP/T Employee

No LTILL

Person 3Parent of HoH

Female 72

No qualsEcon Inactive

LTILL

Page 36: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Small Area Microdata file• 5% sample of individuals • Full range of variables• LA lowest geography

• Except Isles of Scilly and City of London in E and W; similar exceptions in S and NI

– Excludes communal establishments– Age 11-year bands– Ethnicity – 5 groups or 16 with records

swapping between LAs– Economic activity – 3 categories

• Delivery at CCSR soon

Page 37: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

Using the SARs in Learning and Teaching

• SARs provides easy to use dataset• Fits well with aggregate data• Supported by learning and

teaching materials – www.chcc.ac.uk

• Access managed in same way:– use Census Registration System– need ATHENS (for data and CHCC)

Page 38: Working with the 2001 Licensed Individual SAR Coverage and quality SAR data issues Analysing SAR data Software The other datasets…

User support• Web pages are regularly updated• Documentation online• Resources and links added as we go• Seminar invitations welcome!• Regional workshop invites welcome!• SARs Helpdesk

[email protected]– (0161) 275 4735

• Join email and newsletter lists• SARs User Group – July 15th, RSS, London