Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is...
-
Upload
molly-adair -
Category
Documents
-
view
216 -
download
2
Transcript of Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is...
![Page 1: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/1.jpg)
Linking the DAMES & e-Stat Nodes
Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting
DAMES is the ‘Data Management through e-Social Science’ research Node , www.dames.org.uk
![Page 2: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/2.jpg)
2
1. Some background on DAMES
2. First thoughts on linking DAMES and e-Stat
3. Some proposals on usability / services
![Page 3: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/3.jpg)
3
1) Data Management though e-Social Science
DAMES – www.dames.org.uk
ESRC Node funded 2008-2011
Aim: Useful social science provisionsSpecialist data topics – occupations; education qualifications;
ethnicity; social care; health Mainstream packages and accessible resources
Aim: To exploit/engage with existing DM resources In social science – e.g. ESDS, CESSDA In e-Science – e.g. OGSA-DAI; OMII
![Page 4: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/4.jpg)
4
To us ‘Data management’ means…
‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]
Usually performed by social scientists themselves• Pre-analysis tasks (though often revised/updated)• Inputs also from data providers
Usually a substantial component of the work process• But may not be explicitly rewarded (and sometimes penalised)
differentiate from archiving / controlling data itselfdifferentiate from archiving / controlling data itself
![Page 5: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/5.jpg)
5
Some components…
Manipulating data Recoding categories / ‘operationalising’ variables
Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)
Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions
Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’
Cleaning data ‘missing values’; implausible responses; extreme values
![Page 6: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/6.jpg)
6
Example – recoding data
Count
323 0 0 0 0 323
982 0 0 0 0 982
0 425 0 0 0 425
0 1597 0 0 0 1597
0 0 340 0 0 340
0 0 3434 0 0 3434
0 0 161 0 0 161
0 0 0 1811 0 1811
0 0 0 0 2518 2518
0 0 0 331 0 331
0 0 0 0 421 421
0 0 0 257 0 257
102 0 0 0 0 102
0 0 0 0 2787 2787
138 0 0 0 0 138
1545 2022 3935 2399 5726 15627
-9 Missing or wild
-7 Proxy respondent
1 Higher Degree
2 First Degree
3 Teaching QF
4 Other Higher QF
5 Nursing QF
6 GCE A Levels
7 GCE O Levels or Equiv
8 Commercial QF, No OLevels
9 CSE Grade 2-5,ScotGrade 4-5
10 Apprenticeship
11 Other QF
12 No QF
13 Still At School No QF
Highesteducationalqualification
Total
-9.001.00
Degree2.00
Diploma
3.00 Higherschool orvocational
4.00 Schoollevel orbelow
educ4
Total
![Page 7: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/7.jpg)
7
Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk
![Page 8: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/8.jpg)
8
Matching files (‘deterministic’)
Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching
SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta
One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .Stata: merge pid using file2.dta
Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid)
Many-to-Many matches
Related cases matching
![Page 9: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/9.jpg)
9
A bit of focus…
I tend to emphasise two data management activities:
1) Variable constructions o Coding and re-coding values
2) Linking datasetso Internal and external linkages
![Page 10: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/10.jpg)
10
..plus the centrality of keeping clear records of DM activities
Reproducible (for self)Replicable (for all)Paper trail for whole
lifecycleCf. Dale 2006; Freese 2007
In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)
Syntax Examples: www.longitudinal.stir.ac.uk
![Page 11: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/11.jpg)
Principle DAMES services (current status)
GESDE specialist data environments (prototypes)
Occupations, educational qualifications, ethnicity
Data curation tool (prototype)
Data fusion tool (prototype)
Secure data demonstrator for e-Health research (complete) Micro-simulation model for social care data (prototype) Training workshops and events (in progress)
11
![Page 12: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/12.jpg)
GEMDE – Grid Enabled Specialist Data Environments
12
![Page 13: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/13.jpg)
GEODE – Occupational data
![Page 14: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/14.jpg)
Data curation tool
14
The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way
![Page 15: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/15.jpg)
Data fusion tool
15
![Page 16: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/16.jpg)
16
2. Linking DAMES and e-Stat
High level vision is to ingrain data management functionality and uptake within e-Stat modelling capabilities
- Using/adapting DAMES contributions- DAMES services for data linking- DAMES resources for recoding variables
- Making replication central to the data story
![Page 17: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/17.jpg)
Data and variables
DAMES does not in general provide routes to new/alternative microdata, but to relevant supplementary data (e.g. aggregate data)
Anything on educational qualifications, occupations, ethnicity is of particular interest
Generic tools for merging micro-dataGeneric tools for other variable processes
17
![Page 18: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/18.jpg)
Data oriented review
Applied research perspective Range of data resources Accessing and documenting data resource
options
18
![Page 19: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/19.jpg)
The implementation for e-Stat
This is mostly a blank space… …and we’ve not hitherto used Python
Data curation tool and GEODE/GEEDE use IRODS
GEMDE uses a bespoke SQL database Data fusion tool uses R (and some Stata)
scripts accessed via a Liferay portal
![Page 20: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/20.jpg)
20
3. A pitch for specific e-Stat facilities
..harvest the best of data analysis packages from applied data perspective
Replication in ‘human readable syntax’Something like Stata’s ‘est store’ for multiple
model comparisonsFluency in data oriented options Training resources in data
![Page 21: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/21.jpg)
Est store demo here
21
![Page 22: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/22.jpg)
Appendix items
22
![Page 23: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/23.jpg)
23
Data file specification Variable manipulation & analysis
DAMES most common commands:
Commands invoking other packages
-> usedataset{UKDA_5151}
-> usedatafile{individuals wave A}
-> matchdata{individuals wave A;individuals wave B; link
variable=pid; format=wide}
-> SPSS{match files file=“aindresp.sav” /file=“bindresp.sav”
/by=pid}
-> SPSS{fre var=ajbrgsc}
-> Stata{recode ageb 16/30=1 31/50=2 *=.}
-> R{..}
-> Stata{do $path2\part1_analysis.do}
Model 1:
Graphics
Text interface
Invoked manually or in response to manipulating graphs
BHPS, wave A individuals
BHPS wave B individuals.
Analytical file
Wave C
Gender Current job RGSC
Spouse CAMSIS
Age (yrs) Age
bands
Spouse SOC
![Page 24: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/24.jpg)
24
‘The significance of data management for social survey research’
(see http://www.esds.ac.uk/news/eventdetail.asp?id=2151)
The data manipulations described above are a major component of the social survey research workload
Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories• Dealing with missing records
Post-release manipulations performed by researchers • Re-coding measures into simple categories
We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently
So the ‘significance’ of DM is about how much better research might be if we did things more effectively…
![Page 25: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/25.jpg)
25
Some provocative examples for the UK…
Social mobility is increasing, not decreasing!− Popularity of controversial findings associated with Blanden et al (2004)− Contradicted by wider ranging datasets and/or better measures of stratification position− DM: researchers ought to be able to more easily access wider data and better variables
Degrees, MSc’s and PhD’s are getting easier!− {or at least, more people are getting such qualifications}− Correlates with measures of education are changing over time − DM: facility in identifying qualification categories & standardising their relative value within
age/cohort/gender distributions isn’t, but should, and could, be widespread
‘Black-Caribbeans’ are not disappearing! − As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly
prominent due to return migration and social integration of immigrant descendants − Data collectors under-pressure to measure large groups only− DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such
as by merging survey data sources and/or linking with suitable summary measures
![Page 26: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/26.jpg)
26
Comment – growing interest in data management..?
Historically, references covering DM were few and far between• Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London:
Unwin Hyman Ltd. Recently, there’s been a small burst of relevant references
• Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS Statistics 17.0. Chicago, Il.: SPSS Inc. .
• Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.
• Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass.
• http://www.esds.ac.uk/support/onlineguides.asp• http://www.longitudinal.stir.ac.uk/
..and growing interest re. ‘documentation for replication’ • Dale, A. (2006). Quality Issues with Survey Research. International Journal of
Social Research Methodology, 9(2), 143-158.• Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not
Sociology? Sociological Methods and Research, 36(2), 2007.
![Page 27: Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.](https://reader035.fdocuments.in/reader035/viewer/2022070306/5515e28e55034638038b4cb2/html5/thumbnails/27.jpg)
27
E-Science and Data Management
E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM…
1. Concern with standards setting in communication and enhancement of data
2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources
3) Contribution of metadata tools/standards for variable harmonisation and standardisation
4) Linking data subject to different security levels
5) The workflow nature of many DM tasks