Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary ›...
Transcript of Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary ›...
Internet2 Technology Exchange 2018 October, 2018
Kris Steinhoff
PrivaScope: Enabling Data Analytics
Goals: An Ethical, Privacy Preserving Platform
● Enable researchers to ask aggregate questions across multiple data sets in a ethical, privacy-preserving manner. Allow for a privacy and ethics body review to ensure that only appropriate, aggregate questions are asked.
● Allow researchers to ask aggregate questions across multiple data sets while no researcher has direct access to the data sets.
● Enable U-M ITS to support such queries in a scalable, effective manner.
PrivaScope: Enabling Data Analytics
Wi-Fi Mobility Data
GIS GIS GISDEVICE LOCATION/TIME
SERIES (AT REST/IN TRANSIT)
AP LOCATION
BUILDINGROOM
GIS
GIS (X, Y, Z)
PATH
GIS
GIS
DEVICE LOCATION
DEVICE
SIGNAL STRENGTH
AP DIRECTION
MULTIPLE APs TRIANGULATION
COLLISION
GIS
GIS
COHORT
GIS
GISGIS
IDENTITY
MAC ADDRESS
MAC ADDRESSUNIQUE ID
ROLE HOME BASE
TIME
AP NAME
CAMPUS
SUB- CAMPUS
PrivaScope: Enabling Data Analytics
PrivaScope 1.0 Portal
PrivaScope: Enabling Data Analytics
Overview
Privascope Infrastructure
Privascope Secure Enclave
People Wifi . . .
Data Sources
Data Loader Enclave
Database
Sandbox Database
Researcher
Running code
Processing Node
Running code
Researcher Portal
- Study request- Study approval- Code run scheduling- Results approval
directquery
anonymizedsubset
results reviewed before release
requeststudy
schedule code run
Privascope Infrastructure
Privascope Secure Enclave
People Wifi . . .
Data Sources
Data Loader Enclave
Database
Sandbox Database
Researcher
Running code
Processing Node
Running code
Researcher Portal
- Study request- Study approval- Code run scheduling- Results approval
directquery
anonymizedsubset
results reviewed before release
requeststudy
schedule code run
PrivaScope: Enabling Data Analytics
Technical Architecture
PrivaScope: Enabling Data Analytics
Technical Architecture
Processing NodeLinux VM
Docker
Rabbit MQ Researcher PortalWeb Application
PrivaScope: Enabling Data Analytics
Technical Architecture
Processing NodeLinux VM
Docker
Rabbit MQ Researcher PortalWeb Application
Web application written in Django using the django-fsm library to manage workflow. Deployed outside the PrivaScope Enclave, currently in an on-prem OpenShift cluster.
PrivaScope: Enabling Data Analytics
Technical Architecture
Processing NodeLinux VM
Docker
Rabbit MQ Researcher PortalWeb Application
Job queueing is handled with the Celery python library using Rabbit MQ.
PrivaScope: Enabling Data Analytics
Technical Architecture
Processing NodeLinux VM
Docker
Rabbit MQ Researcher PortalWeb Application
Jobs are run in Docker containers to achieve process isolation.
PrivaScope: Enabling Data Analytics
Horizontal Scaling
Processing NodeLinux VM
Rabbit MQ Researcher PortalWeb Application
Processing NodeKubernetes Cluster
Processing NodeHPC VM
This architecture allows for horizontal scaling at the processing node level.
PrivaScope: Enabling Data Analytics
Technical Architecture
Processing NodeLinux VM
Docker
Rabbit MQ Researcher PortalWeb Application
1. Researcher: submits algorithm/code through PrivaScope portal2. PrivaScope Review Board: reviews privacy protection attributes of the code
IF APPROVED3. PrivaScope staging processing: queues algorithm for execution in secure enclave4. PrivaScope query engine: runs algorithm in secure enclave5. PrivaScope Review Board: reviews the output to ensure privacy protection compliance
IF APPROVED6. Output is released to researcher for publishing
PrivaScope: Enabling Data Analytics
Workflow
PrivaScope: Enabling Data Analytics
Technical Architecture
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
PrivaScope: Enabling Data Analytics
Workflow
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
Researcher submits job code and dependencies.
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
Code is reviewed by the PrivaScope team.
PrivaScope: Enabling Data Analytics
Workflow
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
If approved, the job is queued for execution.
PrivaScope: Enabling Data Analytics
Workflow
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
The runner retrieves job from the queue and builds the image in Docker.
PrivaScope: Enabling Data Analytics
Workflow
analysis.py
import osfrom mongo import Connectionimport pandas as pd
wifi = Connection(os.getenv('MONGODB_URL')).wifi
df = pd.DataFrame(list(wifi.find()))
# ... analysis …
df.to_csv('results.csv')
PrivaScope: Enabling Data Analytics
Job FormatDockerfile (required)
FROM python3:latest
RUN mkdir /usr/src/appWORKDIR /usr/src/app
COPY . /usr/src/app/CMD venv/bin/python3 analysis.py
analysis.py
import osfrom mongo import Connectionimport pandas as pd
wifi = Connection(os.getenv('MONGODB_URL')).wifi
df = pd.DataFrame(list(wifi.find()))
# ... analysis …
df.to_csv('results.csv')
PrivaScope: Enabling Data Analytics
Job Format
The Dockerfile is used by PrivaScope to create a Docker image.
Dockerfile (required)
FROM python3:latest
RUN mkdir /usr/src/appWORKDIR /usr/src/app
COPY . /usr/src/app/CMD venv/bin/python3 analysis.py
analysis.py
import osfrom mongo import Connectionimport pandas as pd
wifi = Connection(os.getenv('MONGODB_URL')).wifi
df = pd.DataFrame(list(wifi.find()))
# ... analysis …
df.to_csv('results.csv')
PrivaScope: Enabling Data Analytics
Job Format
The researcher can include dependencies with their job to support their analysis code.
Dockerfile (required)
FROM python3:latest
RUN mkdir /usr/src/appWORKDIR /usr/src/app
COPY . /usr/src/app/CMD venv/bin/python3 analysis.py
analysis.py
import osfrom mongo import Connectionimport pandas as pd
wifi = Connection(os.getenv('MONGODB_URL')).wifi
df = pd.DataFrame(list(wifi.find()))
# ... analysis …
df.to_csv('results.csv')
PrivaScope: Enabling Data Analytics
Job Format
PrivaScope will populate several variables into the environment of the running container to allow the analysis code to connect to data in the enclave.
Dockerfile (required)
FROM python3:latest
RUN mkdir /usr/src/appWORKDIR /usr/src/app
COPY . /usr/src/app/CMD venv/bin/python3 analysis.py
analysis.py
import osfrom mongo import Connectionimport pandas as pd
wifi = Connection(os.getenv('MONGODB_URL')).wifi
df = pd.DataFrame(list(wifi.find()))
# ... analysis …
df.to_csv('/srv/data/results.csv')
PrivaScope: Enabling Data Analytics
Job Format
The analysis code can output results to a standard location which will be collected by PrivaScope for review.
Dockerfile (required)
FROM python3:latest
RUN mkdir /usr/src/appWORKDIR /usr/src/app
COPY . /usr/src/app/CMD venv/bin/python3 analysis.py
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
The job is run in a Docker container. The container not given any network access outside the PrivaScope enclave.
PrivaScope: Enabling Data Analytics
Workflow
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
The job results are returned to the web application workflow.
PrivaScope: Enabling Data Analytics
Workflow
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
The results are reviewed by the PrivaScope team to ensure that they only contain aggregate results.
PrivaScope: Enabling Data Analytics
Workflow
Processing NodeLinux VM
Job runner
Researcher PortalWeb Application
Job submittedJob
approvedyes
Results released
Results approved
Docker
Build
Run
Collect Results
yes
Rabbit MQ
If approved, the results are made available to the researcher.
PrivaScope: Enabling Data Analytics
Workflow
● Refine PrivaScope 1.0 workflows and administration.
● Integration with Git (GitLab merge requests and/or CI/CD).
● Our goal for PrivaScope 2.0 is to build an API that allows users to query arbitrarily and
have the API enforce privacy preservation.
PrivaScope: Enabling Data Analytics
Future Plans
PrivaScope: Enabling Data Analytics
Questions