Big Data & Faster Research - SWITCH · 2018-03-27 · Big Data & Faster Research How can...

47
Big Data & Faster Research How can SWITCHengines boost your research? April 27th 2017

Transcript of Big Data & Faster Research - SWITCH · 2018-03-27 · Big Data & Faster Research How can...

Big Data & Faster Research

How can SWITCHengines boost your research?

April 27th 2017

Content

Ø  SWITCH and SWITCHengines at a glance Konrad Jaggi Jens-Christian Fischer

SWITCH Ø  Use of SWITCHengines for Big Data Anthony Strittmatter

University of St. Gallen Ø  SCALE-UP Project Patrik Schnellmann

SWITCH Ø  How to get my Big Data cluster Piyush Harsh

ZHAW ICClab Ø  How to use SWITCHengines @HSG Christian Lazur

University of St. Gallen Ø  Discussion and Hands-on

How can I use SWITCHengines

Order a NEW SWITCHengine@HSG: Step 1: Sign up for Cloud Services

https://cloud-id.switch.ch/ Step 2: Write an eMail to [email protected]

The eMail must contain a valid billing adress. Step 3: You will receive an eMail with a link to your SWITCH-

engines end-user portal. Click on it. Step 4: Log-in and start your engines.

Today: Only steps 3 and 4 are relevant J.

Costs

Cost of a «running» SWITCHengine: -  CPU -  RAM -  Disk Storage -  IP-Adress

When your SWITCHengine is «shut down»: -  Disk Storage -  IP-Adress

Item Pricing (Current as of Feb. 22nd 2017) Item CHF / day CHF / month CHF / year CPU Core 0.4932 15.00 180.00

RAM (1 GB) 0.2466 7.50 90.00

Disk Storage (1 GB) 0.0015 0.04583 0.55 SSD Storage (1 GB) 0.0066 0.20 2.40 IPv4 Address 0.0274 0.83 10.00

https://www.switch.ch/engines/

Item CHF / day CHF / month CHF / year

Small (1 CPU, 1 GB RAM, 20 GB Disk, 1 IPv4) 0.80 24.25 291.00

Medium (2 CPU, 2 GB RAM, 20 GB Disk, 1 IPv4) 1.54 46.75 561.00

Compute Intensive (8 CPU, 16 GB RAM, 50 GB Disk, 1 IPv4) 7.99 243.13 2’917.50

High IOPS DB Server (4 CPU, 32 GB RAM, 200 GB SSD, 1 IPv4) 11.10 337.50 4’050.00

Hands-on

For Beginners…

1.  Einrichten einer SWITCHengine http://www.intranet.unisg.ch/~/media/Internet/Content/Dateien/Intranet/Services_Richtlinien/IT/Services/Anleitung%20SWITCHengine%20Einrichten.pdf?fl=de

2.  Tipps und Tricks http://www.intranet.unisg.ch/~/media/Internet/Content/Dateien/Intranet/Services_Richtlinien/IT/Services/Tipps%20und%20Tricks%20f%c3%bcr%20die%20t%c3%a4gliche%20Arbeit.pdf?fl=de

© 2017 SWITCH | 1

[email protected] [email protected]

Universität St. Gallen, 27. April 2017

How can SWITCHengines boost your research?

Big Data and Faster Research

© 2017 SWITCH

• SWITCH • SCALE-UP project

• SWITCHengines

• Use Cases

• Roadmap for next months

Agenda

2

© 2017 SWITCH

Foundation Purpose

Excerpt from the Deed of Foundation Berne, 22 October 1987

“The foundation has as its objective “to create, promote and offer the necessary basis for the effective use of modern methods of telecomputing in teaching and research in Switzerland, to be involved in and to support such methods. It is a non-profit foundation that does not pursue commercial aims.”

© 2017 SWITCH

Video Management

Collaboration Procurement

Infrastructure & Data Services

Network

Registry

Trust & Identity

Security

Integrated Offer

© 2017 SWITCH

Infrastructure and Data Services

SWITCH made – Swiss made •  Swiss law and data location •  In accordance to the need of – and

controlled by – the institutions •  Flexible usage and charging model •  Simple administration; integrated

into the academic network of SWITCH; security and identity services included

•  Support for academic use cases •  Created together with you

© 2017 SWITCH

Goals SCALE-UP • Create academic services on the cloud infrastructure • User group in focus are researchers and lecturers Duration • August 2015 – December 2017

Project partners • 8 project partners from universities

Funding • Co-funded by the program “Scientific Information” of

swissuniversities with matching funds of the institutions

The SCALE-UP project

6

© 2017 SWITCH

Project partners

7

© 2017 SWITCH

• Research in the Cloud • Classroom in the Cloud • Big Data Analytics • Statistical Workbench • Scientific Data Pools • Collaborative Apps • Container Technologies • Reporting / Accounting / Billing • VM Management Tools • Virtual Private Cloud • Marketplace

Topics

8

© 2017 SWITCH

Goal • Learn how to use Big Data analysis tools • Provision a ready to use Hadoop and Apache Spark

cluster.

Work Package Lead • Piyush Harsh, ZHAW Sandbox on SWITCHengines with Zeppelin, Spark, jupyter: https://help.switch.ch/engines/documentation/switch-official-images/zeppelin/

Big Data Analytics

9

© 2017 SWITCH

Goal • Store large quantities of data • Manage your data sets and share them with other

researchers

Work Package Lead • Sofiane Sarni, EPFL

Examples • 1000 genomes, Common crawl, Google ngrams See http://datasets.cloud.switch.ch

Scientific Data Pools

10

© 2017 SWITCH

Goal •  Integrate SWITCHengines virtual machines into the

campus network • Access easily to systems behind the firewall • Profit from scalability and redundancy

Work Package Lead • Tom Schönenberger, FHS St.Gallen

• Currently in beta phase – we are looking for institutions who want to pilot this service during 2nd half of 2017.

Virtual Private Cloud

11

© 2017 SWITCH

SWITCHengines Status

12

By Joydeep (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

© 2017 SWITCH

Engines across Switzerland

13

UniGE

UNIL

UniBE

UZHETH Zurich

UniFR

ZHAW

EPFL

USI

UniBASUniSG

HSR

SUPSI

NTBBFH

HSLU

FHSGFHNWPHSGPSI

HES-SO

PHZH

FFHS

29+ core institutions 10+ extended entities

© 2017 SWITCH

• At Swiss Universites • Fully integrated into the SWITCH network infrastructure

Datacenters in Zurich and Lausanne

14

© 2017 SWITCH

• Long-term “native” cloud devops know-how (design and operations)

• SWITCHengines in production internally since 2014 • Public services since 2015/2016 • Several SWITCH services run on it • Over 1000 individual users and around 100 research &

education projects online

Current Status

15

© 2017 SWITCH

• The Lausanne datacenter of SWITCHengines was expanded with 16 compute and 16 storage nodes, adding: – 768 vCPU cores for a total of 1216 cores – 4 TB of RAM for a total of 7.4 TB – 768 TB raw storage for a total of 1.5 PB

• The Zurich datacenter added another 16 compute nodes – 768 vCPU cores for a total of 2248 cores – 4 TB of RAM for a total of 12.6 TB

• Plus planning for 3PB in Q2 2017

Expansion of Datacenters

16

© 2017 SWITCH

Software Stack

17 17

© 2017 SWITCH

• 5 “Scientific Information” projects by swissuniversities – DLCM – Data Lifecycle Management – GeoData 4 Swiss Edu – GIS – HEPIA – Simulation – NEICH – High Energy Physics – SCITAS – eScience

• Various SWITCH internal services – SWITCHdrive (52 TB data, 22000 users) – SWITCHfilesender – SWITCHtube…

Use Cases: Projects & National Services

18

© 2017 SWITCH

Use Case: Machine Learning

19

– Anthony Strittmatter, Assistant Professor for Econometrics

We require a secure IT environment which enables us to compute the estimation results in a reasonable amount of time. A big advantage is, that the data remains in Switzerland.

© 2017 SWITCH

Use Case: Geodata

20

– Dirk Engelke, Professor of Spatial Development at HSR

In our research we ask ourselves how to organize the spatial pattern to ensure e.g. a quality of service public in the future…. (by using) SWITCHengines, we can run our systems optimally and have enough power and performance

© 2017 SWITCH

Use Case: Managing Big Datasets

21

– Andri Lareida, Institute for Informatics, University of Zurich

I could imagine going for SWITCHengines …. So we would not have to care about hardware maintenance and still get the compute power when we need it. Furthermore, I could profit from the scalability: With a higher number of machines I get the results for my research much faster!

© 2017 SWITCH

•  Local SSD VMs •  Upgrades OpenStack Kilo / Liberty / Mitaka •  Expansion ZH/LS (total 48 Compute Nodes, 32 Storage Nodes) •  Control plane split •  IPv6 for private network & VMs •  Swift Object Store •  Network performance improvements •  Sending Bills •  Storage improvements: Snapshot creation/deletion •  17 patches upstream • Monitoring / Grafana •  Centralized logging (ELK) •  FIWARE – cloud as a service •  Network redundancy •  Upgrade Ceph Storage Software from Hammer -> Jewel

Engineering (2016/2017)

22

© 2017 SWITCH

Roadmap 2017

23

© 2017 SWITCH

VMs with Local SSD storage • For High IOPS tasks, you can use VMs that have local

SSD storage (instead of shared Ceph storage) • Size up to 400 GB Containers • For many projects less a question of configuration but of

how many of machines needed • Generally, our users envision docker containers (that then

start subcontainers from the application)

Local SSD storage / Containers

24

© 2017 SWITCH

Reporting • Reporting to customers about usage and costs of VMs and

storage Administration •  In 2017, we will be able to delegate admin tasks to the

institutions (Projects / Subprojects / Roles / Groups / Quota Management)

Reporting / Administration

25

User Case SWITCHEngines for Empirical Economic Research

April 27, 2017

Anthony Strittmatter

Swiss Institute for Empirical Economic Research (SEW)

Prof. Christina Felfe Dr. Alex Krumer

Prof. Michael Lechner

Michael Knaus

Carina Steckenleiter Daniel Goller

Gabriel Okasa

Michael Zimmert

Examples of Research Questions

1.  ALMP Evaluations (rich social security data) –  Who benefits most from participation? –  Selection of sequential training courses?

2.  Industrial Organisation (unstructured data scraped from the internet) –  Are the trade reactions to the Volkswagen emission scandal

coherent with classical models of adverse selection?

3.  Sport Economics (minute-by-minute soccer data) –  Prediction of Bundesliga outcome (

www.sew.unisg.ch/soccer_analytics).

Experience with SWITCHEngines

•  32GB RAM, 4 cores, Windows operating system. •  Remote desktop connection. •  Installation of the statistical software. •  SWITCHEngines responsible for maintenance. •  SWITCHFilesender & SWITCHDrive.

•  According to our experience, the connection with the virtual machine is stable and fast.

•  After using SWITCHEngines, the instance can be paused to save costs. But it can be relaunched within 5 minutes.

Possible Current Disadvantages (of SWITCHEngines and Cloud Computing in General)

1.  Very expensive (approx. 4,000 CHF/year) –  Much computing power for short time period –  Flexible working environment –  No maintenance

2.  Scalability of instance? –  Automatic shutdown when CPU is not used? –  Automatic (re-)scaling of RAM and cores?

3.  Data security? –  Security leak when remote connection is used? –  Is SWITCHEngines ‘save enough’ to store social-

security data?

Computational Social Science Workshop

•  Jointly with University Konstanz •  Next date September, 25 •  http://bigdata.unisg.ch

Thank you for your attention! [email protected] www.anthonystrittmatter.com

Distributed Computing as a Service

What is DISCO?

Cluster administration is hard!● Hadoop + Spark has hundreds of configuration parameters● Takes time to master them!

DISCO provides a one stop platform for provisioning BigData clusters on-demand➔ Create clusters on demand, and set the resource size as per your needs➔ Use it, analyze data, evaluate results➔ Once done, delete it - minimize your costs!➔ Optimized defaults based on hortonworks, cloudera and Amazon implementations

Ideally suited for★ Students, researchers, lecturers.★ Executing transfer projects, consultancies, etc.★ Analytics and ML Labs - clusters shared with teams of students

What is DISCO?

★ Cloud Orchestration framework for deploying Distributed Computing clusters on SWITCHengines

★ Automatic one-click provisioning of Big Data processing frameworks★ Dashboard tailored to professor - student interaction

○ But you can generalize the roles - admin, user-groups!★ Easily extensible if new components are needed★ Privileged end user (professor) can select from a list what is to be installed

and make the cluster available to group of students

DISCO Dashboard & Backend

Frameworks

● Currently, the following Big Data processing frameworks can be provisioned automatically via DISCO:

… but new components can be added in just minutes

Ex. Use Case: ZHAW (Prof. Kurt Stockinger)

Disco being used to provision and manage the data-analytics clusters for lab exercises

● Professor: Dr. Kurt Stockinger● Name of the course: Information Engineering 2● Number of students enrolled: 35● Tools used in the lab: Apache Spark + Hadoop (DataFrames, RDD, machine

learning, HDFS)

Demo

● Walk through the professor’s interface● Walk through the student’s interface● Key entities in DISCO

○ Infrastructure○ Clusters○ Groups

Further information

● https://icclab.github.io/disco/ ● Wiki: https://github.com/icclab/disco/wiki