Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary ›...

Post on 04-Jul-2020

0 views 0 download

Transcript of Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary ›...

Internet2 Technology Exchange 2018 October, 2018

Kris Steinhoff

PrivaScope: Enabling Data Analytics

Goals: An Ethical, Privacy Preserving Platform

● Enable researchers to ask aggregate questions across multiple data sets in a ethical, privacy-preserving manner. Allow for a privacy and ethics body review to ensure that only appropriate, aggregate questions are asked.

● Allow researchers to ask aggregate questions across multiple data sets while no researcher has direct access to the data sets.

● Enable U-M ITS to support such queries in a scalable, effective manner.

PrivaScope: Enabling Data Analytics

Wi-Fi Mobility Data

GIS GIS GISDEVICE LOCATION/TIME

SERIES (AT REST/IN TRANSIT)

AP LOCATION

BUILDINGROOM

GIS

GIS (X, Y, Z)

PATH

GIS

GIS

DEVICE LOCATION

DEVICE

SIGNAL STRENGTH

AP DIRECTION

MULTIPLE APs TRIANGULATION

COLLISION

GIS

GIS

COHORT

GIS

GISGIS

IDENTITY

MAC ADDRESS

MAC ADDRESSUNIQUE ID

ROLE HOME BASE

TIME

AP NAME

CAMPUS

SUB- CAMPUS

PrivaScope: Enabling Data Analytics

PrivaScope 1.0 Portal

PrivaScope: Enabling Data Analytics

Overview

Privascope Infrastructure

Privascope Secure Enclave

People Wifi . . .

Data Sources

Data Loader Enclave

Database

Sandbox Database

Researcher

Running code

Processing Node

Running code

Researcher Portal

- Study request- Study approval- Code run scheduling- Results approval

directquery

anonymizedsubset

results reviewed before release

requeststudy

schedule code run

Privascope Infrastructure

Privascope Secure Enclave

People Wifi . . .

Data Sources

Data Loader Enclave

Database

Sandbox Database

Researcher

Running code

Processing Node

Running code

Researcher Portal

- Study request- Study approval- Code run scheduling- Results approval

directquery

anonymizedsubset

results reviewed before release

requeststudy

schedule code run

PrivaScope: Enabling Data Analytics

Technical Architecture

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Web application written in Django using the django-fsm library to manage workflow. Deployed outside the PrivaScope Enclave, currently in an on-prem OpenShift cluster.

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Job queueing is handled with the Celery python library using Rabbit MQ.

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Jobs are run in Docker containers to achieve process isolation.

PrivaScope: Enabling Data Analytics

Horizontal Scaling

Processing NodeLinux VM

Rabbit MQ Researcher PortalWeb Application

Processing NodeKubernetes Cluster

Processing NodeHPC VM

This architecture allows for horizontal scaling at the processing node level.

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

1. Researcher: submits algorithm/code through PrivaScope portal2. PrivaScope Review Board: reviews privacy protection attributes of the code

IF APPROVED3. PrivaScope staging processing: queues algorithm for execution in secure enclave4. PrivaScope query engine: runs algorithm in secure enclave5. PrivaScope Review Board: reviews the output to ensure privacy protection compliance

IF APPROVED6. Output is released to researcher for publishing

PrivaScope: Enabling Data Analytics

Workflow

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

PrivaScope: Enabling Data Analytics

Workflow

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

Researcher submits job code and dependencies.

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

Code is reviewed by the PrivaScope team.

PrivaScope: Enabling Data Analytics

Workflow

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

If approved, the job is queued for execution.

PrivaScope: Enabling Data Analytics

Workflow

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The runner retrieves job from the queue and builds the image in Docker.

PrivaScope: Enabling Data Analytics

Workflow

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job FormatDockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job Format

The Dockerfile is used by PrivaScope to create a Docker image.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job Format

The researcher can include dependencies with their job to support their analysis code.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job Format

PrivaScope will populate several variables into the environment of the running container to allow the analysis code to connect to data in the enclave.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('/srv/data/results.csv')

PrivaScope: Enabling Data Analytics

Job Format

The analysis code can output results to a standard location which will be collected by PrivaScope for review.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The job is run in a Docker container. The container not given any network access outside the PrivaScope enclave.

PrivaScope: Enabling Data Analytics

Workflow

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The job results are returned to the web application workflow.

PrivaScope: Enabling Data Analytics

Workflow

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The results are reviewed by the PrivaScope team to ensure that they only contain aggregate results.

PrivaScope: Enabling Data Analytics

Workflow

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

If approved, the results are made available to the researcher.

PrivaScope: Enabling Data Analytics

Workflow

● Refine PrivaScope 1.0 workflows and administration.

● Integration with Git (GitLab merge requests and/or CI/CD).

● Our goal for PrivaScope 2.0 is to build an API that allows users to query arbitrarily and

have the API enforce privacy preservation.

PrivaScope: Enabling Data Analytics

Future Plans

PrivaScope: Enabling Data Analytics

Questions