4. Data Management and Variable Operationalisations Paul Lambert, 24-25 August 2009 Presented to...
Transcript of 4. Data Management and Variable Operationalisations Paul Lambert, 24-25 August 2009 Presented to...
4. Data Management and Variable Operationalisations
Paul Lambert, 24-25 August 2009
Presented to ‘Data Management for Social Survey Research’, a workshop organised by the ESRC ‘Data Management through e-
Social Science’ research Node (www.dames.org.uk).
2
Deriving variables, handling missing data, and cleaning data
..Especially common types of data manipulation..
1) Deriving variables = computing new measures for purposes of analysis
o E.g. recoding complex categorical variables; standardising scores; linking micro- and macro-data
o {Creating composite vars., e.g. selection model hazards, propensity scores, weights}
2) Handling missing data = strategies for item or case non-response o E.g. imputation approaches; listwise/pairwise deletion o {deriving ‘missing variables’ via ‘data fusion’}o Clarifying, stating & documenting assumptions (see www.missingdata.org.uk)
3) Cleaning data = monitoring and adjusting responses across a given set of variables
o E.g. extreme values; erroneous values; re-scaling distributions;
3
Variable operationalisations
Analytical and conceptual issues1) Harmonisation ≠ naming (!)
2) The value of trying multiple measures and standardisations
3) Multivariate v’s univariate context
4) Functional form
Thinking about ‘key variables’
4
1) Why harmonisation ≠ naming
Much attention to variable operationalisations involves proposing optimum / standard measures
UK – ONS Harmonisation EU – Eurostat standards
Studies of ‘criterion’ and ‘construct’ validity
Standard measures impact other analyses Affects available data Affect interpretations of data
5
“a method for equating conceptually similar but operationally different variables..” [Harkness et al 2003, p352]
Input harmonisation [esp. Harkness et al 2003]
‘harmonising measurement instruments’ [H-Z and Wolf 2003, p394]unlikely / impossible in longer-term longitudinal studies common in small cross-national and short term lngtl. studies
Output harmonisation (‘ex-post harmonisation’)‘harmonising measurement products’ [H-Z and Wolf 2003, p394]
Key variables – Harmonisation
(across countries; across time periods)
6
More on harmonisation [esp. HZ and Wolf 2003, p393ff]
Numerous practical resources to help with input and output harmonisation [e.g. ONS www.statistics.gov.uk/about/data/harmonisation ; UN / EU /
NSI’s; LIS project www.lisproject.org; IPUMS www.ipums.org ]
[Cross-national e.g.: HZ & Wolf 2003; Jowell et al. 2007]
Room for more work in justifying/ understanding interpretations after harmonisation
7
“the degree to which survey measures or questions are able to assess identical phenonema across two or more cultures”
[Harkness et al 2003, p351]
Equivalence
Measurement equivalence involves same instruments and equality of measures (e.g. income in pounds)
Functional equivalence involves different instruments, but addresses same concepts (e.g. inflation adjusted income)
8
“Equivalence is the only meaningful criterion if data is to be compared from one context to another. However, equivalence of measures does not necessarily mean that the measurement instruments used in different countries are all the same. Instead it is essential that they measure the same dimension. Thus, functional equivalence is more precisely what is required”
[HZ and Wolf 2003, p389]
More importantly, measurement equivalence rarely achieves what it intends to in the context of longitudinal and/or cross-national comparisons…
9
Measurement equivalence and social class
Show tabplot here
Unskilled
Skilled manual
Petty-bourg.
Non-manual
Salariat
Source: Females from LFS/GHS, using data from Li and Heath (2008)
percent of year category
Goldthorpe class scheme harmonised over time
10
Harmonisation & equivalence combined
‘Universality’ or ‘specificity’ in variable constructions
Universality: collect harmonised measures, analyse standardised schemes
Specificity: collect localised measures, analyse functionally equivalent schemes
Most prescriptions aim for universality
But specificity is theoretically betterSpecificity is more easily obtained than is often realisedEspecially for well-known ‘key variables’ [e.g. Lambert et al., 2008]
11
2) The value of trying out different ‘variable constructions’
Meaning? Coding frames; re-coding decisions; metric
transformations and functional forms; relative effects in multivariate models
Data collection and data analysis Cf. www.longitudinal.stir.ac.uk/variables/
processes by which survey measures are defined and subsequently interpreted by research analysts
12
β’s - Where’s the action?
If we have lots of variables, lots of cases, yet often quite simple techniques and software, the action is primarily in the variable constructions…
o The example of social mobility research – see Lambert et al. (2007)
i. How we chose between alternative measures
ii. How much data management we try (or bother with)
Plus other issues in how we analyse & interpret the results of analysis
13
Working with variables - speculation
Data manipulation skills and inertia
I would speculate that around 80% of applications using key variables don’t consult literature and evaluate alternative measures, but choose the first convenient and/or accessible variable in the dataset Data supply decisions (‘what is on the archive version’) are critical
Much of the explanation lies with lack of confidence in data manipulation / linking data
Too many under-used resources – cf. www.esds.ac.uk
14
Working with variables – further issues
Re-inventing the wheel …In survey data analysis, somebody else has already
struggled through the variable constructions your are working on right now…
Increasing attention to documentation and replicability [cf Dale 2006; Freese 2007]
Guidance and support In the UK, use www.esds.ac.uk Most guidance concerns collecting & harmonising dataLess is directed to analytically exploiting measures
15
3) Multivariate contexts
‘Beta’s in Society’ and ‘Demystifying Coefficients’‘Beta’s in Society’ and ‘Demystifying Coefficients’
Dorling, D., & Simpson, S. (Eds.). (1999). Statistics in Society: The Arithmetic of Politics. London: Arnold.
Irvine, J., Miles, I., & Evans, J. (Eds.). (1979). Demystifying Social Statistics. London: Pluto Press.
Famous works on critical interpretation of social statistics tend to have a univariate / bivariate focus Measuring unemployment; averaging income; bivariate significance
tests; correlation v’s causation
But social survey analysts usually argue that complex multivariate analyses are more appropriate.. Critical interpretation of joint relative effects Attention to effects of ‘key variables’ in multivariate analysis
16
“Socio-economic processes require comprehensive approaches as they are very complex (‘everything depends on everything else’). The data and computing power needed to disentangle the multiple mechanisms at work have only just become available.” [Crouchley and Fligelstone 2004]
17
Endogeneity and variable measures
‘everything depends on everything else’ [Crouchley and Fligelstone 2004]
The effects of concepts are very difficult to isolate
Example of key variables (e.g. education, occupations)o Key variables often change the main effects of other variableso Simple decisions about contrast categories can influence
interpretations o Interaction terms are often significant and influentialo Key variables are often endogenous (because they are ‘key’!)
18
4) ‘Functional form’
The way in which measures are arithmetically incorporated in analysis
• Level of measurement (nominal, ordinal, interval, ratio)
• Alternative models and link functions• Other variables and interaction effects
19
Levels of measurement and the desire to categorise
Categories are easier to envisage / communicate
o Much harmonisation work ≡ locating into categories o Appearance of measurement equivalenceo But functional equivalence is seldom achieved
Metrics are better for functional equivalenceo E.g. Standardised incomeo How to deal with categorisations?
• The qualitative foundation of quantity [Prandy 2002]
20
Example: categorisation and the scandalous use of
collapsed EGP/NS-SEC…!
Ignores heterogeneity within occupations
Defines and hinges on arbitrary boundaries
Creates artefactual gender differences
0.0
2.0
4.0
6
Null EGP11 EGP7 EGP5 EGP3 EGP2 SOC
Satisfaction with health
R2 improvement BIC decline
0.0
5.1
.15
.2.2
5
Null EGP11 EGP7 EGP5 EGP3 EGP2 SOC EGP11+
Income
R2 improvement BIC decline
21
The scaling alternative…
Many concepts can be reasonably regarded as metric cf. simplified / dichotomisted categorisations
Comparability / standardisation is easier with scales
Complex / Multi-process systems are easier with scales Structural Equation Models Interaction effects
Growing availability/use of distance score techniques Stereotyped ordered logit [‘slogit’ in Stata] Correspondence Analysis Latent variable models
o …But, scaling seems to be seen by some as a wicked, positivistic activity..!
22
Being creative with functional forms
Treiman (2009: 162): nonlinear specifications of time and age effectsYear of birth effect on literacy in China: dicontinuity at
1955; curve 1955-1967; knot at 1967
23
Practical suggestions on functional form
It’s rare not to have a few alternative measures of the same concepts at different levels of measurement
Good practice would be to try alternative measures and see what difference they
makeconsider treatment of missing values in relation to
measurement instrument choiceEngage as much as possible with other studies
24
Variable operationalisations
Analytical and conceptual issues1) Harmonisation ≠ naming (!)
2) The value of trying multiple measures and standardisations
3) Multivariate v’s univariate context
4) Functional form
Thinking about ‘key variables’
25
Key variables and social science measurement
Defining ‘key variables’ - Commonly used concepts with numerous previous
examples - Methodological research on best practice / best
measurement
[cf. Stacey 1969; Burgess 1986] ONS harmonisation ‘primary standards’
http://www.statistics.gov.uk/about/data/harmonisation/primary_standards.asp
26
Key variables: concepts and measures
Variable Concept Something useful
Occupation Class; stratification; unemployment
www.geode.stir.ac.uk
Education Credentials; Ability; Merit www.equalsoc.org/8 ;
[Schneider 2008]
Ethnic group Ethnicity; race; religion; national origins
[Bosveld et al 2006]
Age Age; life course stage; cohort
[Abbott 2006]
Gender Gender; household / family context
www.genet.ac.uk
Income Income; wealth; poverty; www.data-archive.ac.uk [SN 3909]
27
An example: Occupations
In the social sciences, occupation is seen as one of the most important things to know about a personDirect indicator of economic circumstancesProxy Indicator of ‘social class’ or ‘stratification’
Projects at Stirling (www.dames.org.uk)o GEODE – how social scientists use data on occupationso DAMES – extending GEODE resources
Stage 1 - Collecting Occupational Data (and making a mess)
Example 1: BHPS Occ description Employment status SOC-2000 EMPST
Miner (coal) Employee 8122 7
Police officer (Serg.) Supervisor 3312 6
Electrical engineer Employee 2123 7
Retail dealer (cars) Self-employed w/e 1234 2
Example 2: European Social Survey, parent’s dataOcc description SOC-2000 EMPST
Miner ?8122 ?6/7
Police officer ?3312 ?6/7
Engineer ?? ??
Self employed businessman ?? ?1/2
29
www.geode.stir.ac.uk/ougs.html
30
Occupations: we agree on what we should do:
Preserve two levels of data Source data: Occupational unit groups, employment status Social classifications and other outputs
Use transparent (published) methods [i.e. OIR’s] for classifying index units for translating index units into social classifications
for instance.. Bechhofer, F. 1969. 'Occupations' in Stacey, M. (ed.) Comparability in Social Research. London:
Heinemann. Jacoby, A. 1986. 'The Measurement of Social Class' Proceedings from the Social Research
Association seminar on "Measuring Employment Status and Social Class". London: Social Research Association.
Lambert, P.S. 2002. 'Handling Occupational Information'. Building Research Capacity 4: 9-12. Rose, D. and Pevalin, D.J. 2003. 'A Researcher's Guide to the National Statistics Socio-economic
Classification'. London: Sage.
31
…in practice we don’t keep to this...
Inconsistent preservation of source data• Alternative OUG schemes
o SOC-90; SOC-2000; ISCO; SOC-90 (my special version)
• Inconsistencies in other index factors o ‘employment status’; supervisory status; number of employeeso Individual or household; current job or career
Inconsistent exploitation of Occupational Information Numerous alternative occupational information files
o (time; country; format)• Inconsistent translations to social classifications – ‘by file or by fiat’• Dynamic updates to occupational information resources • Strict security constraints on users’ micro-social survey data
32
GEODE provides services to help social scientists deal with occupational information resources
1) disseminate, and access other, Occupational Information Resources
2) Link together their (secure) micro-data with OIR’s
External user
(micro-social data)
Occ info (index file) (aggregate)
User’s output
(micro-social data)
id oug sex . oug CS-M CS-F EGP id oug CS
1 110 1 . 110 60 58 I 1 110 60 .
2 320 1 . 320 69 71 II 2 320 69 .
3 320 2 . 874 39 51 VIIa 3 320 71 .
4 874 1 . 4 874 39 .
5 874 2 . 5 874 51 .
33
Existing resources on occupations
Popular websites: http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/ http://home.fsw.vu.nl/~ganzeboom/pisa/ www.iser.essex.ac.uk/esec/ www.camsis.stir.ac.uk/occunits/distribution.html
Emerging resource: http://www.geode.stir.ac.uk/
Some papers: Chan, T. W., & Goldthorpe, J. H. (2007). Class and Status: The Conceptual
Distinction and its Empirical Relevance. American Sociological Review, 72, 512-532.
Rose, D., & Harrison, E. (2007). The European Socio-economic Classification: A New Social Class Scheme for Comparative European Research. European Societies, 9(3), 459-490.
Lambert, P. S., Tan, K. L. L., Gayle, V., Prandy, K., & Bergman, M. M. (2008). The importance of specificity in occupation-based social classifications. International Journal of Sociology and Social Policy, 28(5/6), 179-192.
34
Using data on occupations – further speculation
Growing interest in longitudinal analysis and use of longitudinal summary data on occupations
o Intuitive measures (e.g. ever in Class I) Lampard, R. (2007).
o Empirical career trajectories / sequences Halpin, B., & Chan, T. W. (1998).
Growing cross-national comparisons• Ganzeboom, H. B. G. (2005)..
Treatment of the non-working populationso Seldom adequate to treat non-working as a categoryo ‘Selection modelling’ approaches expanding
35
Occupations as key variables
Extensive debate about occupation-based social classifications o Document your procedures.. o ..as you may be asked to do something different..
When choosing between occupation-based measures… They all measure, mostly, the same things Don’t assume concepts measure measures
o Lambert, P. S., & Bihagen, E. (2007). Concepts and Measures: Empirical evidence on the interpretation of ESeC and other occupation-based social classifications. Paper presented at the ISA RC28 conference, Montreal (14-17 August), www.camsis.stir.ac.uk/stratif/archive/lambert_bihagen_2007_version1.pdf .
36
Data management and key variables
In DAMES, we identify three important categorical variables (occupations, educational qualifications, ethnicity), and collect information about them in order to improve ‘data management’ and hence exploitation of such data
‘Key’ social science variables Existing resources (and metadata & support on those resources) UK and beyond
37
‘Occupational Information Resources’
Small databases (square electronic files) linking lists of occupational positions (occupational unit groups) with information about those positions
Many existing resources already used in academic research (> 1000)
38
Educational information resources
Small databases (often on paper) linking lists of educational qualifications with information about them
Many existing resources (>500), but less communication between them
[Part of UK scheme from ONS (2008)]
39
Ethnic Minority/Migration Information Resources
Data which links measures of ethnicity / migration status with other information
In high demand, but few existing resources (? < 500)
40
Summary – Variable operationalisations and
social science We argue that the route to better critical understanding
of variable effects combines complex analysis with many mundane, prosaic tasks in checking data
ANALYSIS: Coefficient effects in multivariate models; multi-process models; understanding interactions; etc
DATA MANAGEMENT: Re-coding data; linking data; missing data mechanisms; reviewing literature
o Seldom central to previous methodological reviewso Cf. www.dames.org.uk
41
Appendix
42
Existing resources (i): Data providers - a) Documentation and metadata files
43
Existing resources (i): Data providers
b) Resources for variables CESSDA PPP on key variables http://www.nsd.uib.no/cessda/project/ UK Question Bank http://qb.soc.surrey.ac.uk/ ONS Harmonisation http://www.statistics.gov.uk/about/data/
c) Resources for datasets UK Census data portal, http://census.ac.uk/ IPUMS international census data facilities, www.ipums.org European Social Survey, www.europeansocialsurvey.org
d) Data manipulations prior to data release Missing data imputation / documentation Survey design / weighting information Influential – most analysts use ‘the archive version’
44
Existing resources (ii) Resource projects / infrastructures
- UK ESDS www.esds.ac.uk ESDS International | ESDS Government ESDS Longitudinal | ESDS Qualidata
- Helpdesks; online instructions; user support..
- UK ESRC NCRM / NCeSS / RDI initiatives- Longitudinal data – www.longitudinal.stir.ac.uk - Linking micro/macro - www.mimas.ac.uk/limmd/
- Other resources / projects / initiatives- EDACwowe - http://recwowe.vitamib.com/datacentre- ….
45
Existing resources (iii) Analytical and software support
Textbooks featuring data management [Levesque 2008] [Sarantakos 2007] [Long 2009]
Software training covering DM Stata’s ‘data management’ manual SPSS user group course on syntax and data management,
www.spssusers.co.uk
But generally, sustained marginalisation of DM as a topic Advanced methods texts use simplistic data Advanced software for analysis isn’t usually combined with extended
DM requirements
46
Existing resources (iv) Data analysts’ contributions
Academic researchers often generate and publish their own DM resources, e.g.
Harry Ganzeboom on education and occupations, http://home.fsw.vu.nl/~ganzeboom/pisa/
Provision of whole or partial syntax programming examples
Analysts often drive wider resource provisions related to DM
CAMSIS project on occupational scales, www.camsis.stir.ac.uk
CASMIN project on education and social class
47
Existing resources (v) Literatures on harmonisation and standardisation
National Statistics Institutes’ principles and practices
E.g. ONS www.statistics.gov.uk/about/data/harmonisation/
Cross-national organisationsE.g. UNSTATS - http://unstats.un.org/unsd/class/
Academic studiesE.g. [Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf
2003] [Jowell et al. 2007]
48
References Abbott, A. (2006). Mobility: What? When? How? In S. L. Morgan, D. B. Grusky & G. S. Fields (Eds.), Mobility and Inequality. Stanford University
Press. Bosveld, K., Connolly, H., Rendall, M. S., & (2006). A guide to comparing 1991 and 2001 Census ethnic group data. London: Office for National
Statistics. Burgess, R. G. (Ed.). (1986). Key Variables in Social Investigation. London: Routledge. Crouchley, R., & Fligelstone, R. (2004). The Potential for High End Computing in the Social Sciences. Lancaster: Centre for Applied Statistics,
Lancaster University, and http://redress.lancs.ac.uk/document-pool/hecsspotential.pdf. Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. Dorling, D., & Simpson, S. (Eds.). (1999). Statistics in Society: The Arithmetic of Politics. London: Arnold. Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007. Halpin, B., & Chan, T. W. (1998). Class Careers as Sequences : An optimal matching analysis of work-life histories. European Sociological
Review, 14(2), 111-130. Ganzeboom, H. B. G. (2005). On the Cost of Being Crude: A Comparison of Detailed and Coarse Occupational Coding. In J. H. P.
Hoffmeyer-Zlotnick & J. Harkness (Eds.), Methodological Aspects in Cross-National Research (pp. 241-257). Mannheim: ZUMA, Nachrichten Spezial.
Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley. Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and
Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers. Irvine, J., Miles, I., & Evans, J. (Eds.). (1979). Demystifying Social Statistics. London: Pluto Press. Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage. Lambert, P. S., Prandy, K., & Bottero, W. (2007). By Slow Degrees: Two Centuries of Social Reproduction and Mobility in Britain. Sociological
Research Online, 12(1). Lambert, P. S., Tan, K. L. L., Gayle, V., Prandy, K., & Bergman, M. M. (2008). The importance of specificity in occupation-based social
classifications. International Journal of Sociology and Social Policy, 28(5/6), 179-192. Lampard, R. (2007). Is Social Mobility an Echo of Educational Mobility? Parents' Educations and Occupations and Their Children's
Occupational Attainment. Sociological Review Online, 12(5). Li, Y., & Heath, A. F. (2008). Socio-Economic Position and Political Support of Black and Ethnic Minority Groups in the United Kingdom,
1972-2005 [computer file]. 2nd Edition. Colchester, Essex: UK Data Archive [distributor], SN: 5666. Prandy, K. (2002). Measuring quantities: the qualitative foundation of quantity. Building Research Capacity, 2, 3-4. Procter, M. (2001). Analysing Survey Data. In G. N. Gilbert (Ed.), Researching Social Life, Second Edition (pp. 252-268). London: Sage. Schneider, S. L. (2008). The International Standard Classification of Education (ISCED-97). An Evaluation of Content and Criterion Validity for 15
European Countries. Mannheim: MZES. Simpson, L., & Akinwale, B. (2006). Quantifying Stablity and Change in Ethnic Group. Manchester: University of Manchester, CCSR
Working Paper 2006-05. Stacey, M. (Ed.). (1969). Comparability in Social Research. London: Heineman. Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass.