Distributed Computing Using...

55
Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October 18th, 2017 Distributed Computing Using Spark Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala

Transcript of Distributed Computing Using...

Page 1: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Albert-Ludwigs-Universität Freiburg

Practical / Praktikum WS17/18

October 18th, 2017

Distributed Computing Using Spark

Prof. Dr. Georg Lausen

Anas Alzogbi

Victor Anthony Arrascue Ayala

Page 2: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Agenda

Introduction to Spark

Case-study: Recommender system for scientific papers

Organization

Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 2

Page 3: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Agenda

Introduction to Spark

Case-study: Recommender system for scientific papers

Organization

Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 3

Page 4: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Introduction to Spark

Distributed programming

MapReduce

Spark

18.10.2017 Distributed Computing Using Spark WS17/18 4

Page 5: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Distributed programming - problem

Data grows faster than processing capabilities

- Web 2.0: users generate content

- Social networks, online communities, etc.

18.10.2017 Distributed Computing Using Spark WS17/18 5

Source: https://www.flickr.com/photos/will-lion/2595497078

Page 6: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Big Data

18.10.2017 Distributed Computing Using Spark WS17/18 6

Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

Source: http://www.bigdata-startups.com/open-source-tools/

Page 7: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Big Data

Buzzword

Often less-structured

Requires different techniques, tools, approaches- To solve new problems or old ones in a better way

18.10.2017 Distributed Computing Using Spark WS17/18 7

Page 8: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Network Programming Models

Requires a communication protocol for programming parallel computers (slow)- MPI (wiki)

Locality of the data and the code across the network have to be done manually

No failure management

Network problems not solved (e.g. stragglers)

18.10.2017 Distributed Computing Using Spark WS17/18 8

Page 9: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Data Flow Models

Higher-level of abstraction: algorithms are parallelized on large clusters

Fault-recovery by means of data replication

Job divided into a set of independent tasks

- Code is shipped to where the data is located

Good scalability

18.10.2017 Distributed Computing Using Spark WS17/18 9

Page 10: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Key ideas

1. Problem is split into smaller problems (map step)

2. Smaller problems are solved in a parallel fashion

3. Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step)

18.10.2017 Distributed Computing Using Spark WS17/18 10

Page 11: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Overview

18.10.2017 Distributed Computing Using Spark WS17/18 11

split 1

split 0 Map

Map

Map

Reduce

Reduce

output 0

output 1

<k,v> Data

split 2

Input Data

A target problem has to be parallelizable!!!

Page 12: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Wordcount example

18.10.2017 Distributed Computing Using Spark WS17/18 12

Google Maps charts new territory into businesses

Google selling new tools for businesses to build their own maps

Google promises consumer experience for businesses with Maps Engine Pro

Google is trying to get its Maps service used by more businesses

Google 4

Maps 4

Businesses 4

Engine 1

Charts 1

Territory 1

Tools 1

Page 13: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Wordcount’s map

18.10.2017 Distributed Computing Using Spark WS17/18 13

Google Maps charts new territory into businesses

Google selling new tools for businesses to build their own maps

Google promises consumer experience for businesses with Maps Engine Pro

Google is trying to get its Maps service used by more businesses

Map

Map

Google 2

Charts 1

Maps 2

Territory 1

Google 2

Businesses 2

Maps 2

Service 1

Page 14: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Wordcount’s map

18.10.2017 Distributed Computing Using Spark WS17/18 14

Google Maps charts new territory into businesses

Google selling new tools for businesses to build their own maps

Google promises consumer experience for businesses with Maps Engine Pro

Google is trying to get its Maps service used by more businesses

Map

Map

Google 2

Charts 1

Maps 2

Territory 1

Google 2

Businesses 2

Maps 2

Service 1

Page 15: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Wordcount’s reduce

18.10.2017 Distributed Computing Using Spark WS17/18 15

Reduce

Reduce

Google 2

Google 2

Maps 2

Maps 2

Businesses 2

Businesses 2

Charts 1

Territory 1

Google 4

Maps 4

Businesses 4

Charts 1

Territory 1

Page 16: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Wordcount’s reduce

18.10.2017 Distributed Computing Using Spark WS17/18 16

Reduce

Reduce

Google 2

Google 2

Maps 2

Maps 2

Businesses 2

Businesses 2

Charts 1

Territory 1

Google 4

Maps 4

Businesses 4

Charts 1

Territory 1

Page 17: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce

Automatic

- Partition and distribution of data

- Parallelization and assignment of tasks

- Scalability, fault-tolerance, scheduling

18.10.2017 Distributed Computing Using Spark WS17/18 17

Page 18: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Apache Hadoop

Open-source implementation of MapReduce

18.10.2017 Distributed Computing Using Spark WS17/18 18

So

urc

e: h

ttp

://w

ww

.bo

go

tob

og

o.c

om

/Ha

do

op

/Big

Da

ta_h

ad

oo

p_E

cosy

ste

m.p

hp

Page 19: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Parallelizable algorithms

Matrix-vector multiplication

Power iteration (e.g. PageRank)

Gradient descent methods

Stochastic SVD

Matrix Factorization (Tall skinny QR)

etc…

18.10.2017 Distributed Computing Using Spark WS17/18 19

Page 20: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MapReduce – Limitations

Inefficient for multi-pass algorithms

No efficient primitives for data sharing

State between steps is materialized and distributed

Slow due to replication and storage

18.10.2017 Distributed Computing Using Spark WS17/18 20

Source: http://stanford.edu/~rezab/sparkclass

Page 21: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Limitations – PageRank

Requires iterations of multiplications of sparse matrix and vector

18.10.2017 Distributed Computing Using Spark WS17/18 21

Source: http://stanford.edu/~rezab/sparkclass

Page 22: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Limitations – PageRank

MapReduce sometime requires asymptotically more communication or I/O

Iterations are handled very poorly

Reading and writing to disk is a bottleneck

- In some cases 90% of time is spent on I/O

18.10.2017 Distributed Computing Using Spark WS17/18 22

Page 23: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark Processing Framework

Developed in 2009 in UC Berkeley’s

In 2010 open sourced at Apache

- Most active big data community

- Industrial contributions: over 50 companies

Written in Scala

- Good at serializing closures

Clean APIs in Java, Scala, Python, R

18.10.2017 Distributed Computing Using Spark WS17/18 23

Page 24: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark Processing Framework

18.10.2017 Distributed Computing Using Spark WS17/18 24

Contributors (2014)

Page 25: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark – High Level Architecture

18.10.2017 Distributed Computing Using Spark WS17/18 25

HD

FS

Source: https://mapr.com/ebooks/spark/

Page 26: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark - Running modes

Local mode: for debugging

Cluster mode

- Standalone mode

- Apache Mesos

- Hadoop Yarn

18.10.2017 Distributed Computing Using Spark WS17/18 26

Page 27: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark – Programming model

Spark context: the entry point

Spark Session: since Spark 2.0- New unified entry point. It combines SQLContext,

HiveContext and future StreamingContex

Spark Conf: to initialize the context

Spark’s interactive shell- Scala: spark-shell

- Python: pyspark

18.10.2017 Distributed Computing Using Spark WS17/18 27

Page 28: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark – RDDs, the game changer

Resilient distributed datasets

A typed data-structure (RDD[T]) that is not language specific

Each element of type T is stored locally on a machine

- It has to fit in memory

An RDD can be cached in memory

18.10.2017 Distributed Computing Using Spark WS17/18 28

Page 29: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Resilient Distributed Datasets

Immutable collections of objects, spread across cluster

User controlled partitioning and storage

Automatically rebuilt on failure

RDDs are replaced by Dataset, which is strongly-typed like an RDD (Spark > 2.0)

18.10.2017 Distributed Computing Using Spark WS17/18 29

Page 30: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark – Wordcount example

text_file = sc.textFile("...")

counts = text_file.flatMap(lambda line: line.split(" ")) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("...")

18.10.2017 Distributed Computing Using Spark WS17/18 30

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext

Page 31: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark – Data manipulation

Transformations: always yield a new RDD instance (RDDs are immutable)

- filter, map, flatMap, etc.

Actions: triggers a computation on the RDD’s elements

- count, foreach, etc.

Lazy evaluation of transformations

18.10.2017 Distributed Computing Using Spark WS17/18 31

Page 32: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark – DataFrames

DataFrame API introduced since Spark 1.3

Handles table-like representation with named columns and declared column types

Do not confuse with Python’s Pandas DataFrames

DataFrames translate SQL code into RDD low-level operations

Since Spark 2.0, DataFrame is implemented as a special case of DataSet

18.10.2017 Distributed Computing Using Spark WS17/18 32

Page 33: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

DataFrames – How to create DFs

1. Convert existing RDDs

2. Running SQL queries

3. Loading external data

18.10.2017 Distributed Computing Using Spark WS17/18 33

Page 34: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark SQL

SQL context

http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

18.10.2017 Distributed Computing Using Spark WS17/18 34

// Run SQL statements. Returns a DataFrame

students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’)

Page 35: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Spark – DataFrames

18.10.2017 Distributed Computing Using Spark WS17/18 35

So

urc

e: S

pa

rk in

Act

ion

(b

oo

k, s

ee

lit

era

ture

)

Page 36: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Machine Learning (ML) with Spark

ML project steps1. Data collection

2. Data cleaning and preparation

3. Data analysis and feature extraction

4. Model training

5. Model evaluation

6. Model application

18.10.2017 Distributed Computing Using Spark WS17/18 36

So

urc

e:

Sp

ark

in

Act

ion

(b

oo

k, s

ee

lit

era

ture

)

Page 37: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Machine Learning (ML) with Spark

ML with Spark- Perfect for ML parallelizable algorithms!!

- A single platform (the same system and the same API) for performing most tasks:

• Collect, prepare, analyze the data

• Train, evaluate, use the model

- Training and applying ML algorithms on very large datasets

- Offer most of the popular ML algorithms

18.10.2017 Distributed Computing Using Spark WS17/18 37

Page 38: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Machine Learning (ML) with Spark

MLlib- Spark’s machine learning library

- Provides a generalized API for training and tuning different algorithms in the same way (influenced by scikit-learn)

- Relies on several low-level libraries for performing optimized linear algebra operations:

• Breeze, jblas for Scala and java

• NumPy for Python

18.10.2017 Distributed Computing Using Spark WS17/18 38

Page 39: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Machine Learning (ML) with Spark

MLlib two APIs- RDD-based API

• Will be removed in Spark 3.0 (spark.mllib)

- Dataframe-based API, will keep add new features (spark.ml)

• More user-friendly API than RDDs

• A uniform API across ML algorithms and across multiple languages

• Facilitate practical ML Pipelines (feature transformations)

18.10.2017 Distributed Computing Using Spark WS17/18 39

Page 40: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

MLlib abstractions

Transformer- Main method: transform- Examples:

• ML model• Feature transformer

Estimator- main method: fit- Example: ML algorithm

Evaluator- Example: RMSE metric

18.10.2017 Distributed Computing Using Spark WS17/18 40

Estimator Transformer EvaluatorFit

Input

dataset

Evaluation

results

Transforme

d dataset

Estimate

Tra

nsfo

rm

Source: Spark in Action (book, see literature)

Page 41: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

A pipeline chains multiple Transformers and Estimators together to specify an ML workflow

ExampleLearn a prediction model using features extracted from text document

Training phase

MLlib Pipelines

18.10.2017 Distributed Computing Using Spark WS17/18 41

Source: http://spark.apache.org/docs/latest/ml-pipeline.html#properties-of-pipeline-components

Test phase

Page 42: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Organization

Introduction Introduction to Spark

Case-study: Recommender system for scientific papers

Organization

Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 42

Page 43: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Case-study: Recommender system for scientific papers

Motivation- Recommend relevant papers to users

Dataset- Set of papers (~172 K)

• Textual content: Title + abstract

• Attributes: type, journal, pages, year,…

- Set of users (~ 28 K)

- Ratings (~ 828 K ratings)

18.10.2017 Distributed Computing Using Spark WS17/18 43

Page 44: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Organization

Introduction Introduction to Spark

Case-study: Recommender system for scientific papers

Organization

Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 44

Page 45: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Organization

Team

Educational goals

Requirements

ILIAS

Experiments’ submissions

Assessment

Discussion with the tutors

Schedule

18.10.2017 Distributed Computing Using Spark WS17/18 45

Page 46: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Team

Prof. Georg Lausen

Assistants

- Anas

- Anthony

Tutors

- Polina Koleva

- Matteo Cossu

18.10.2017 Distributed Computing Using Spark WS17/18 46

Page 47: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Educational goals

Distributed programming paradigm

Recommender Systems (use case)

Theoretical and practical training

- Master project and thesis

Data Science profile for work market

18.10.2017 Distributed Computing Using Spark WS17/18 47

Page 48: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Requirements

Mandatory

- Registration via HisInOne

- Attendance to Kick-off meeting

Recommended

- Attendance of DAQL, SIDS or ML lectures

- Basics In Python programming

18.10.2017 Distributed Computing Using Spark WS17/18 48

Page 49: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

ILIAS

Distributed Computing Using Spark -WS1718https://ilias.uni-freiburg.de/goto.php?target=crs_878841

Access with course password

Forum for clarification questions of tasks

- Do not post solutions or suggestions

18.10.2017 Distributed Computing Using Spark WS17/18 49

Page 50: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Experiments’ submissions

6 experiments, 2-3 weeks of working time

Submissions in groups of 2 students (Form your group)

Submissions via ILIAS

18.10.2017 Distributed Computing Using Spark WS17/18 50

Page 51: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Assessment

Each experiment: 50 points. Overall 300 points.

At least 70% of the points required to pass

Corrections done by tutors

18.10.2017 Distributed Computing Using Spark WS17/18 51

Page 52: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Discussion of solutions with tutors

Mandatory attendance

Each member has to be able to explain all tasks!- 0 points for that task

Copied solutions- First time: 0 points for that experiment

- Second time: failure of the practical

18.10.2017 Distributed Computing Using Spark WS17/18 52

Page 53: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Schedule

18.10.2017 Distributed Computing Using Spark WS17/18 53

Experiment Content Release Submission Discussion

1.Familiarizing with Tools, Loading Data, and Basic Analysis of Data

18.10.2017 01.11.2017, 11h 08.11.2017

2. Experiment 2 01.11.2017 15.11.2017, 11h 22.11.2017

3. Experiment 3 15.11.2017 29.11.2017, 11h 06.12.2017

4. Experiment 4 29.11.2017 13.12.2017, 11h 20.12.2017

5. Experiment 5 13.12.2017 10.01.2018, 11h 17.01.2018

6. Experiment 6 10.01.2018 31.01.2018, 11h 07.02.2018

Page 54: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Literature

Spark in Action [book] by Petar Zečević Marko Bonaći

Machine Learning with Spark [book] by Nick Pentreath

Apache Spark documentation:http://spark.apache.org/docs/latest

18.10.2017 Distributed Computing Using Spark WS17/18 54

Page 55: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October

Organization

Introduction to Spark

Case-study: Recommender system for scientific papers

Organization

Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 55