Exploring New Methods for Protecting and Distributing Confidential Research Data
-
Upload
bryan-beecher -
Category
Education
-
view
781 -
download
0
description
Transcript of Exploring New Methods for Protecting and Distributing Confidential Research Data
Exploring New Methods for Protecting and Distributing Confidential Research Data
Bryan BeecherFelicia LeClereICPSR/University of Michigan
Today’s Talk
• What’s ICPSR?• How do organizations distribute
confidential research data today?• What are the problems?• What can we improve?
What’s ICPSR?
• Inter-university Consortium for Political and Social Research– JSTOR for social
science data
• Serving billions since 1962
Who does ICPSR serve?
• Research universities– Discover and download data
• Teaching universities and colleges– On-line analysis
• Federal agencies– Data management, preservation, and
dissemination
Distributing data
Distributing data
• Most of our content is public-use– Anonymized public opinion– Aggregate government data
• Little risk of disclosure• But what about the good stuff?
Distributing sensitive data
Distributing sensitive data
• Higher risk of breech of confidentiality– Variables that give geographic
information that might be combined with other data sources to identify a respondent
• Requires special handling
Distributing sensitive data
• Researcher agrees to protect the data and identities
• Delivered securely
• Harsh penalty deterrent
http://www.flickr.com/photos/lwr/521394398
National Longitudinal Study of Adolescent Health
• Add Health– Highly used and cited study
• Very frank questions– Kids in 7th through 12th grade
• Carolina Population Center• Gold standard in data protection
Traditional Approach
http://www.flickr.com/photos/videolux/2389320345/
http://www.flickr.com/photos/curiousexpeditions/3767246490/
Traditional Approach
Confidential research
data
Apply for access
Write security plan
Repeat
Can we improve upon it?
• Paperwork– How do we speed
the application process?
• Security– How do we ensure
the data are going to a good home?
Paperwork
• Web portal– Research plan– IRB approval– CVs– Confidentiality
agreements
Paperwork
• Web portal– Behavioral
questionnaire– Electronic copy of
contract (HTML, PDF)
– Database back-end to drive workflow systems
Restricted data Contracting System
• Integrated with ICPSR’s existing Web download mechanism
• Collects information that would ordinarily be provided through paper
• “Tickler” system to send reminders, nag about missing items
Security
• Current system relies on…– The data provider to maintain
security templates– The researcher to write an IT security
plan– The data provider to read and
understand the plan– The researcher to execute the plan
ResearcherWorkstationICPSR
Current access model
Secure Area
ResearcherWorkstationICPSR
A new access model?
Secure area = the cloud?
• Cloud-based access– Convenient– Scalable– Economical– Perfect?
http://www.flickr.com/photos/docbudie/2240764187/
What could go wrong?
Almost everything
• Is the cloud reliable?• Will the data be safe?• We are building an analytic
environment for a researcher, how will we know what to provide?
• Will this perform well for the researcher?
• This is the main story…
Cloud reliability
• Already using the cloud for DR purposes since January 2009
• The Merit Network Operations Center monitors all of our stuff
• Ping, http GET every minute 24 x 7• Results?
Local v. cloud – CY 2009
Conclusion
• Cloud has been more reliable than local environment
• If local power was better, cloud would still be better, but only a little better
• Certainly seems to be good enough
Cloud security
• Absolute security?– Who cares?
• More secure than the typical WinTel desktop of a social science researcher?– That’s the goal
http://www.flickr.com/photos/amagill/235453953
Current practice
• Data archive maintains per-platform guidelines on IT security
• Researcher downloads a template and writes his/her own IT security plan
• Data provider reviews plan; approves or iterates until approved or rejected
Sample items
– I secured the computer on which the Add Health data resides in a locked room, or secured the computer to a table with a lock and cable (locking the case so the battery cannot be removed).
– I turned off all unneeded services and disabled unneeded network protocols.
Brutal facts
• Data providers are not IT experts• Researchers are not experts in IT security• Even if the system is secure on Day
One, what assurance is there that it continues to be secure?
http://www.flickr.com/photos/42dreams/1878611309
Our approach to security
• Leverage tools from the cloud provider (AWS access control lists)
• Leverage tools from UMich (regular Retina and Nessus scans)
• Engage a white hat hacker to probe and evaluate the system
Conclusion
• Expecting researchers to build and maintain secure IT environments is not reasonable
• We think we can build something at least as secure in the cloud
• We’ll evaluate our environment using outside evaluators
What to deploy?
• Model means we need to distribute a working analytic environment, not just the data
• Also gives the researcher the opportunity to limit access to only a subset of contractees
May I Take Your Order?
• Operating system?
• Analysis software?
• Who’s allowed to use the system?
• Anything else?
http://www.flickr.com/photos/stephenpougas/2267503544
The ACI Chooser
• Analytic Cloud Instance– Cumulus
• The ACI Chooser• Takes your order• Brings your ACI to your table (in the
cloud)
Conclusion
• We’re building this now• Issues to resolve
– How do we get passwords to people?– Remote access mechanism?
• Citrix? Terminal Services?
– Should we encrypt the data?
Performance
• Will a cloud-based analysis system meet the expectations of a researcher?
• Will one size fit all?
Amazon EC2
• Regular– S (1 CPU, 2GB, $0.12)– L (4 CPU, 7GB, $0.48)– XL (8 CPU, 15GB, $0.96)
• High memory– XXL (13 CPU, 34GB, $1.44)– XXXXL (26 CPU, 68GB, $2.88)
• High CPU– M (5 CPU, 2GB, $0.29)– XL (20 CPU, 7GB, $1.16)
Strategy
• Balance cost and performance• Start small, but give opportunity to
grow– Easy to move an image from one
instance size to another
• Measure performance via researcher’s experience
Conclusion
• Partners– Panel Study of Income Dynamics
(PSID)– Los Angeles Family and Neighborhood
Study (LA FANS)
• Start small; re-launch larger• Ask how well it works
Thanks and Final Thoughts
• Could preserve machine image + data + software + “program” for replication purposes
• enclavecloud.blogspot.com charts our adventures
• Cloud-related work sponsored by a recent NIH Challenge Grant