ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux...
Transcript of ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux...
ERICA
A cloud orchestration meta-framework for secure health data
analytics
Tim Churches
SW Sydney Clinical School & Centre for Big Data Research in Health
UNSW Medicine
Why we need secure platforms for health
data analysis
Tran B, Straka P, Falster MO, Douglas KA, Britz T, Jorm LR. Overcoming the data
drought: exploring general practice in Australia by network analysis of big data.
Med J Aust 2018; 209(2):68-73
Overcoming the GP
data drought• No systematically reported national data
on the size and structure of general
practices in Australia
• Network analysis of 21 years of Medicare
claims shows:
• general practices have increased in
size
• continuity of care and patient loyalty
have remained stable
• greater sharing of patients by GPs is
associated with greater patient loyalty
• This new approach allows continuous
monitoring of the characteristics of
Australian general practices
Re-operation after breast
conserving surgery
• Linked hospital inpatient and death data for NSW
• Primary unilateral or bilateral BCS
• 90-day reoperation (re-excision or mastectomy)
• 29% overall re-operation
• 17% BCS
• 12% mastectomy
• ↑ BCS over time, ↓ mastectomy over time
• Significant variation by hospital
van Leeuwen MT, Falster MO, Vajdic CM, Crowe PJ, Lujic S, Klaes E, Jorm L,
Sedrakyan A. Reoperation after breast-conserving surgery for cancer in Australia:
statewide cohort study of linked hospital data. BMJ Open 2018, vol. 8, pp. e020858.
Sydney Morning Herald
Why we need secure platforms for health
data analysis
• Current research at UNSW Medicine (CBDRH, SPHCM, SWSCS, MRIs) ofr
health services research, clinical epidemiology and ML research
• whole-of-NSW-Health administrative data (hospital admissions, ED visits,
cancer registry, death certificates) linked at person level over 15 year
span
• linked MBS-PBS data (not the retracted dataset!)
• EMR and cancer information system data from specific hospitals
• DVA linked data
• 25% subset of NPS MedicineWise data
• All of these are de-identified
• But all of these are potentially re-identifiable
• Must be kept safe!
• Security requirements that exceed “…data will be stored on a password-
protected file server..”
ERICA: key features
• Provides up to 256 secure remote-access analysis project spaces per
instance
• “Enclave” model: each project space is completely self-contained and
disconnected from other projects, from the internet, and from the users’
desktops
• Provides an invigilated gateway for data coming in and research results going
out, with complete audit trail
• Uses Amazon Web Services (AWS) commercial cloud computing
• Leverages the features and scalability of AWS
• Different OS and workspace configurations
• High performance computing
• Multiple storage and pricing options
ERICA: key features
• Is institution-based
• Governed and managed by a host institution and its policies and procedures
• Multiple instances (‘clones’), governed by different host institutions can be
established (anywhere that AWS operates), currently:
• UNSW
• Australian Institute of Health and Welfare (AIHW SRAE)
• NSW Government Data Analytics Centre
• A code-driven ‘orchestration framework’
• Testable and tested for correct behaviour
• System administrators do not manually configure resources
• Project space configuration is point-and-click
• Minimises human error
• Accredited by eHealth NSW under their Privacy and Security Assessment
Framework (PSAF) to hold fully-identified NSW Health data
Typical ERICA virtual workstation
ERICA virtual workstations
• Most current users Windows 7 or Windows 10
• Linux workstations available
• Software can be pre-installed in workstation images (up to 100)
• e.g. MS Office, SAS, SPSS, Stata, R, python, TensorFlow etc pre-installed
• System administrators can define additional HPC resources via templates,
restricted to specific project spaces e.g.
• Linux compute server with multiple high-end GPU cards
• Apache Spark cluster with many nodes
• End-users in the project space can start and stop these on demand and are
given warnings if left running!
‘Five safes’ framework
1. Safe Projects
2. Safe People
3. Safe Data
4. Safe Settings
5. Safe Outputs
Is this use of the data appropriate?
Can the researchers be trusted to use it appropriately?
Is there a disclosure risk in the data itself?
Does the access facility limit unauthorised use?
Are the statistical results non-disclosive?
Desai T, et al. Five Safes: designing data access for research. Economics
working paper series 1601. Bristol: University of the West of England, 2016
ERICA: Safe projects
• Policies set by host institution
• UNSW ERICA
• Projects must have data custodian and ethics approvals
• Projects must therefore meet NHMRC guidelines for human research
• ERICA must be named as data storage and analysis facility on HREC
applications
ERICA: Safe people
• Roles defined in ERICA code and assigned to individuals according to policies
of host institution
• System Administrator
• Project Chief Investigator
• Project Controller
• Project Manager
• Project Researcher
• Online training module for researchers
• With an exam that must be passed…
ERICA: Safe data
• Designed for research using sensitive microdata
• The datasets, variables, level of detail and any suppression or perturbation
are governed by host institution’s policies and data provider policies
• UNSW ERICA: governed according to data custodian and ethics approvals
• Project Controller checks and approves all inbound files
• Role can be assigned to data custodian nominee (e.g. AIHW staff member)
or research team member
• Data custodians can upload encrypted data themselves through eHub or
large file ingress facility
• By carefully attending to the other four “Safes”, ERICA and similar secure
analysis platforms dramatically reduce the level of anonymization which data
providers and data custodians need to do
• Data anonymisation is the enemy of quality research and effective ML
model development
ERICA: Safe settings - threat model
• Basic premise: researchers are honest-but-sloppy
• Ignorant of IT security
• Reliant on institutional IT security
• Driven by convenience
• Designed to protect against
• Innocent acts-of-omission by researchers
• Acts-of-carelessness by researchers
• Malicious acts by non-users (i.e. external hackers)
• But not necessarily malicious acts-of-commission by researchers
• e.g. Filming the screen as they scroll through data
ERICA: Safe settings
– identity and authentication• Authentication and authorisation uses a Microsoft Active Directory instance
specific to ERICA
• ERICA user accounts are assigned to one or more roles (e.g. Project
Controller, Project Researcher) for each project space
• At all external access points, users authenticate themselves using a single
set of login credentials (account name and password) plus mandatory multi-
factor authentication code (using smartphone)
• External access points can be further restricted to specific IP address ranges
or source networks, or client-side digital certificates can be used to restrict
access to specific devices (e.g. specific laptop or desktop computers)
• e.g. UNSW medicine ERICA instance is accessible only from the UNSW
internal network, behind the main UNSW firewall, so no Internet-facing
interface
Logging into ERICA
AWS Desktop client
ERICA: Safe settings
– movement of data• All research data held in ERICA are encrypted both at-rest and in movement
• AWS key management and encryption services are used to strongly encrypt
all EBS and S3 data stores used by ERICA
• Secure protocols, including HTTPS (TLS v1.3), LDAPS, scp and encrypted
SMB/CIFS are used for all communications and data movement
• Users can only import or export data via a controlled gateway mechanism
known as the Hub
• All other file or data ingress and egress mechanisms, including clipboard,
email, messenger services, printing services and internet access, are blocked
by two independent and redundant layers in the system network architecture.
• Project workspaces are isolated from each other, and no data can be
transferred between them (except via the Hub)
Importing and exporting
ERICA: Safe settings
– logging and audit• All data movements inbound to and outbound from ERICA are fully logged
and subject to full-copy audit trails
• An activity trail displays the time, project and the action that a particular user
has taken within the system regarding data movement
• A checksum of the imported/exported file is maintained and logged to ensure
the file has not been modified during the ingress/egress.
• All logging is aggregated into AWS Cloudwatch, which provides a single
unalterable and digitally signed and timestamped source of information for
auditing purposes
• Key security event logs include those generated by: border routing devices,
network and application firewalls, intrusion detection, anti-virus and malicious
code protection services, internet-connected services
• Automated log analysis and notification using industry standard tools is
currently being implemented
ERICA: Safe outputs
• Project Controller checks and approves all outbound files
• Project Controller role is assigned according to policies of host institution
• Can be assigned to data custodian nominee (e.g. AIHW staff member) or
research team member
• Confidentialisation applied according to policies of host institution
• Users are trained in the principles of Statistical Disclosure Control (SDC)
• SDC tools provided
• Expert SDC help available to end users on-demand
ERICA licensing model
• Institutions manage and operate their own ERICA instances
• Employ their own system administrator/s
• Responsible for user accounts
• Responsible for end-user software licenses (e.g. SAS)
• Provide tier one user support
• Set up user accounts, projects
• Triage user issues
• Apply their own policies
• Control the allocation of project roles, auditing etc
• Pay a license fee to UNSW
• Participate in user community and development roadmap
• Shared training and help desk (including SDC) resources and services
ERICA licensing model
• UNSW
• Has no access to other instances at all
• Manages the ERICA master code repository
• Manages development and testing
• Provides tier two support – escalate to AWS or engage developers if code
fix is required
• Other ERICA instances are ‘clones’
• Updates from master repository are pulled by each instance
• DevOps model for easy deployment of updates
ERICA: future plans
• Expand user base
• UNSW ERICA: cross-Faculty projects
• AIHW SRAE: soft launch 2019, hard launch 2020
• Additional ERICA instances
• eHealth NSW/NSW Ministry of Health
• Being evaluated by NSW govt Data Analytics Centre
• Four other Australian universities considering instances
• New and enhanced features
• Re-engineer some components to use microservices to further reduce
costs
• Further streamline setup of new instances of ERICA
• Further streamline on-demand end-user access to HPC
• Possibly diversify to other cloud providers that meet on-shore and security
standards (eg Australian government IRAP accreditation)
Australasia’s First Postgraduate Programs
in Health Data Science
Find out more about the programs:
+61 2 9385 9064
cbdrh.med.unsw.edu.au/study-with-us
Master of ScienceGraduate DiplomaGraduate Certificate