Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary ›...

29
Internet2 Technology Exchange 2018 October, 2018 Kris Steinhoff

Transcript of Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary ›...

Page 1: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Internet2 Technology Exchange 2018 October, 2018

Kris Steinhoff

Page 2: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Goals: An Ethical, Privacy Preserving Platform

● Enable researchers to ask aggregate questions across multiple data sets in a ethical, privacy-preserving manner. Allow for a privacy and ethics body review to ensure that only appropriate, aggregate questions are asked.

● Allow researchers to ask aggregate questions across multiple data sets while no researcher has direct access to the data sets.

● Enable U-M ITS to support such queries in a scalable, effective manner.

Page 3: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Wi-Fi Mobility Data

GIS GIS GISDEVICE LOCATION/TIME

SERIES (AT REST/IN TRANSIT)

AP LOCATION

BUILDINGROOM

GIS

GIS (X, Y, Z)

PATH

GIS

GIS

DEVICE LOCATION

DEVICE

SIGNAL STRENGTH

AP DIRECTION

MULTIPLE APs TRIANGULATION

COLLISION

GIS

GIS

COHORT

GIS

GISGIS

IDENTITY

MAC ADDRESS

MAC ADDRESSUNIQUE ID

ROLE HOME BASE

TIME

AP NAME

CAMPUS

SUB- CAMPUS

Page 4: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

PrivaScope 1.0 Portal

Page 5: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Overview

Privascope Infrastructure

Privascope Secure Enclave

People Wifi . . .

Data Sources

Data Loader Enclave

Database

Sandbox Database

Researcher

Running code

Processing Node

Running code

Researcher Portal

- Study request- Study approval- Code run scheduling- Results approval

directquery

anonymizedsubset

results reviewed before release

requeststudy

schedule code run

Page 6: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Privascope Infrastructure

Privascope Secure Enclave

People Wifi . . .

Data Sources

Data Loader Enclave

Database

Sandbox Database

Researcher

Running code

Processing Node

Running code

Researcher Portal

- Study request- Study approval- Code run scheduling- Results approval

directquery

anonymizedsubset

results reviewed before release

requeststudy

schedule code run

PrivaScope: Enabling Data Analytics

Technical Architecture

Page 7: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Page 8: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Web application written in Django using the django-fsm library to manage workflow. Deployed outside the PrivaScope Enclave, currently in an on-prem OpenShift cluster.

Page 9: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Job queueing is handled with the Celery python library using Rabbit MQ.

Page 10: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Jobs are run in Docker containers to achieve process isolation.

Page 11: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Horizontal Scaling

Processing NodeLinux VM

Rabbit MQ Researcher PortalWeb Application

Processing NodeKubernetes Cluster

Processing NodeHPC VM

This architecture allows for horizontal scaling at the processing node level.

Page 12: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Docker

Rabbit MQ Researcher PortalWeb Application

Page 13: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

1. Researcher: submits algorithm/code through PrivaScope portal2. PrivaScope Review Board: reviews privacy protection attributes of the code

IF APPROVED3. PrivaScope staging processing: queues algorithm for execution in secure enclave4. PrivaScope query engine: runs algorithm in secure enclave5. PrivaScope Review Board: reviews the output to ensure privacy protection compliance

IF APPROVED6. Output is released to researcher for publishing

PrivaScope: Enabling Data Analytics

Workflow

Page 14: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Technical Architecture

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

Page 15: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Workflow

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

Researcher submits job code and dependencies.

Page 16: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

Code is reviewed by the PrivaScope team.

PrivaScope: Enabling Data Analytics

Workflow

Page 17: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

If approved, the job is queued for execution.

PrivaScope: Enabling Data Analytics

Workflow

Page 18: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The runner retrieves job from the queue and builds the image in Docker.

PrivaScope: Enabling Data Analytics

Workflow

Page 19: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job FormatDockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

Page 20: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job Format

The Dockerfile is used by PrivaScope to create a Docker image.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

Page 21: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job Format

The researcher can include dependencies with their job to support their analysis code.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

Page 22: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('results.csv')

PrivaScope: Enabling Data Analytics

Job Format

PrivaScope will populate several variables into the environment of the running container to allow the analysis code to connect to data in the enclave.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

Page 23: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

analysis.py

import osfrom mongo import Connectionimport pandas as pd

wifi = Connection(os.getenv('MONGODB_URL')).wifi

df = pd.DataFrame(list(wifi.find()))

# ... analysis …

df.to_csv('/srv/data/results.csv')

PrivaScope: Enabling Data Analytics

Job Format

The analysis code can output results to a standard location which will be collected by PrivaScope for review.

Dockerfile (required)

FROM python3:latest

RUN mkdir /usr/src/appWORKDIR /usr/src/app

COPY . /usr/src/app/CMD venv/bin/python3 analysis.py

Page 24: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The job is run in a Docker container. The container not given any network access outside the PrivaScope enclave.

PrivaScope: Enabling Data Analytics

Workflow

Page 25: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The job results are returned to the web application workflow.

PrivaScope: Enabling Data Analytics

Workflow

Page 26: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

The results are reviewed by the PrivaScope team to ensure that they only contain aggregate results.

PrivaScope: Enabling Data Analytics

Workflow

Page 27: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

Processing NodeLinux VM

Job runner

Researcher PortalWeb Application

Job submittedJob

approvedyes

Results released

Results approved

Docker

Build

Run

Collect Results

yes

Rabbit MQ

If approved, the results are made available to the researcher.

PrivaScope: Enabling Data Analytics

Workflow

Page 28: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

● Refine PrivaScope 1.0 workflows and administration.

● Integration with Git (GitLab merge requests and/or CI/CD).

● Our goal for PrivaScope 2.0 is to build an API that allows users to query arbitrarily and

have the API enforce privacy preservation.

PrivaScope: Enabling Data Analytics

Future Plans

Page 29: Kris Steinhoff October, 2018 Internet2 Technology Exchange ... › media › medialibrary › ... · PrivaScope: Enabling Data Analytics Workflow Processing Node Linux VM Job runner

PrivaScope: Enabling Data Analytics

Questions