Introduction to Spark on Hadoop

An Overview of Apache Spark

Agenda

• MapReduce Refresher

• What is Spark?

• The Difference with Spark

• Examples and Resources

MapReduce Refresher

MapReduce: A Programming Model

• MapReduce: Simplified Data Processing on Large Clusters (published 2004)

• Parallel and Distributed Algorithm:

• Data Locality

• Fault Tolerance

• Linear Scalability

The Hadoop Strategy

http://developer.yahoo.com/hadoop/tutorial/module4.html

Distribute data (share nothing)

Distribute computation (parallelization without synchronization)

Tolerate failures (no single point of failure)

Node 1

Mapping process

Node 2

Mapping process

Node 3

Mapping process

Node 1

Reducing process

Node 2

Reducing process

Node 3

Reducing process

Chunks are replicated across the cluster

Distribute Data: HDFS

User process

NameNode . . .

network

HDFS splits large data files

into chunks (64 MB)

metadata

access physical data access

Location metadata

DataNodes store & retrieve data

Distribute Computation

MapReduce

Program

Sources

Hadoop Cluster

Result

MapReduce Basics

• Foundational model is based on a distributed file system

– Scalability and fault-tolerance

• Map

– Loading of the data and defining a set of keys

• Many use cases do not utilize a reduce task

• Reduce

– Collects the organized key-based data to process and output

• Performance can be tweaked based on known details of your

source files and cluster shape (size, total number)

MapReduce Execution and Data Flow

Files loaded from HDFS stores

file file

Files loaded from HDFS stores

Node 1

InputFormat InputFormat

OutputFormat OutputFormat

Final (k, v) pairs Final (k, v) pairs

reduce reduce

(sort) (sort)

Input (k, v) pairs

map map map

RR RR RR

RecordReaders:

Split Split Split

Writeback to

Local HDFS

Writeback to

Local HDFS

Split Split Split

RR RR RR

RecordReaders:

Input (k, v) pairs

map map map

“Shuffle” process

Intermediate (k, v)

Pairs exchanged

By all nodes

Partitioner

Intermediate (k, v) pairs

Partitioner

Intermediate (k, v) pairs

MapReduce Example: Word Count

Output

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

the, 1

time, 1

has, 1

come, 1

and, 1

and, [1, 1, 1]

come, [1,1,1]

has, [1,1]

the, [1,1,1]

time, [1,1,1,1]

and, 12

come, 6

has, 8

the, 4

time, 14

Input Map Shuffle

and Sort Reduce Output

Reduce

Tolerate Failures

Hadoop Cluster

Failures are expected & managed gracefully

DataNode fails -> name node will locate replica

MapReduce task fails -> job tracker will schedule another one

MapReduce Processing Model

• Define mappers

• Shuffling is automatic

• Define reducers

• For complex work, chain jobs together

– Use a higher level language or DSL that does this for you

MapReduce Design Patterns

• Summarization – Inverted index, counting

• Filtering – Top ten, distinct

• Aggregation

• Data Organziation – partitioning

• Join – Join data sets

• Metapattern – Job chaining

Inverted Index Example

come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . .

"The time has come," the

Walrus said

alice.txt

tis time to do it

macbeth.txt

time, alice.txt has, alice.txt come, alice.txt ..

tis, macbeth.txt time, macbeth.txt do, macbeth.txt …

MapReduce Example:Inverted Index

• Input: (filename, text) records

• Output: list of files containing each word

• Map: foreach word in text.split(): output(word, filename)

• Combine: uniquify filenames for each word

• Reduce: def reduce(word, filenames): output(word, sort(filenames))

MapReduce: The Good

• Built in fault tolerance

• Optimized IO path

• Scalable

• Developer focuses on Map/Reduce, not infrastructure

• simple? API

MapReduce: The Bad

• Optimized for disk IO

– Doesn’t leverage memory well

– Iterative algorithms go through disk IO path again and again

• Primitive API

– simple abstraction

– Key/Value in/out

– basic things like join

• require extensive code

• Result often many files that need to be combined appropriately

Free Hadoop MapReduce On Demand Training

• https://www.mapr.com/services/mapr-academy/big-data-hadoop-

online-training

What is Hive?

• Data Warehouse on top of Hadoop

– Gives ability to query without programming

– Used for analytical querying of data

• SQL like execution for Hadoop

• SQL evaluates to MapReduce code

– Submits jobs to your cluster

Using HBase as a MapReduce/Hive Source

EXAMPLE: Data Warehouse for Analytical Processing queries

Hive runs

MapReduce

application

Hive Select

Join HBase database

Files (HDFS/MapR-FS)

Query Result File

Using HBase as a MapReduce or Hive Sink

EXAMPLE: bulk load data into a table

Files (HDFS/MapR-FS) HBase database Hive runs

MapReduce application

Hive Insert Select

Using HBase as a Source & Sink

EXAMPLE: calculate and store summaries,

Pre-Computed, Materialized View

HBase database

Hive Select

Hive runs

MapReduce

application

Tracker

HADOOP (MAP-REDUCE + HDFS)

Data Node +

Task Tracker

Hive Metastore

Driver

(compiler, Optimizer, Executor)

Command Line

Interface

Web Interface

Thrift Server

Metastore

The schema metadata is stored

in the Hive metastore

Hive Table definition HBase trades_tall Table

Hive HBase

HBase Tables Hive

metastore

Points to Existing

Hive Managed

Hive HBase – External Table

CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b")

TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall");

Points to

External

key string price bigint vol bigint key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

trades /usr/user1/trades_tall

Hive Table definition HBaseTable

Hive HBase – Hive Query

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ;

HBase Tables

Queries

Parser Planner Execution

Hive HBase – External Table

key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

Selection

WHERE key like

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ;

Projection

select price

Aggregation

Avg( price)

Hive Query Plan

• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

predicate: (key like 'GOOG%') (type: boolean)

Select Operator

Group By Operator Reduce Operator Tree:

Group By Operator

Select Operator

File Output Operator

Hive Query Plan – (2)

output

hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

Trades

aggregations:

avg(price)

scan filter

Select key like 'GOOG%

Select price

Group by

reduce() reduce()

Hive Map Reduce

Region Region Region scan key, row

reduce()

shuffle

reduce()

reduce() Map() Map() Map()

Query Result File

Hive Select Join

Hive Query

result result result

Some Hive Design Patterns

• Summarization

– Select min(delay), max(delay), count(*) from flights group by

carrier;

• Filtering

– SELECT * FROM trades WHERE key LIKE "GOOG%";

– SELECT price FROM trades DESC LIMIT 10 ;

• Join

SELECT tableA.field1, tableB.field2 FROM tableA

JOIN tableB

ON tableA.field1 = tableB.field2;

What is a Directed Acylic Graph (DAG) ?

• Graph

– vertices (points) and edges (lines)

• Directed

– Only in a single direction

• Acyclic

– No looping

• This supports fault-tolerance

Hive Query Plan Map Reduce Execution

RS3 Job 1

Optimize

Slow !

Iteration: the bane of MapReduce

Typical MapReduce Workflows

Input to

SequenceFile

Last Job

Maps Reduces

SequenceFile

Maps Reduces

SequenceFile

Maps Reduces

Output from

Input to

last job

Output from

last job

Iterations

Step Step Step Step Step

In-memory Caching

• Data Partitions read from RAM instead of disk

Free HBase On Demand Training (includes Hive and MapReduce with HBase)

• https://www.mapr.com/services/mapr-academy/big-data-hadoop-

online-training

Lab – Query HBase airline data with Hive

Import mapping to Row Key and Columns:

Row-key

Carrier-

Flightnumber-

Origin- destination

delay info stats timing

Craft delay

Arr delay

Carrier delay

cncl Cncl

tailnum distance elaptime arrtime Dep time

AA-1-2014-01-

01-JFK-LAX

0 N7704 2475 385.00 359 …

Count number of cancellations by reason (code)

$ hive

hive> explain select count(*) as

cancellations, cnclcode from flighttable

where cncl=1 group by cnclcode order by

cancellations asc limit 100;

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

Select Operator

Group By Operator

aggregations: count()

Reduce Output Operator

Reduce Operator Tree:

Group By Operator

aggregations: count(VALUE._col0)

Select Operator

Stage: Stage-2

Map Reduce

Map Operator Tree:

TableScan

Reduce Output Operator

Reduce Operator Tree:

Extract

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

Stage: Stage-0

Fetch Operator

limit: 100

2 MapReduce jobs

$ hive

hive> select count(*) as cancellations, cnclcode from flighttable where

cncl=1 group by cnclcode order by cancellations asc limit 100;

Total jobs = 2

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 820 msec

4598 C

7146 A

Find the longest airline delays

$ hive

hive> select arrdelay,key from flighttable where arrdelay > 1000 order by

arrdelay desc limit 10;

MapReduce Jobs Launched:

Map: 1 Reduce: 1

1530.0 AA-385-2014-01-18-BNA-DFW

1504.0 AA-1202-2014-01-15-ONT-DFW

1473.0 AA-1265-2014-01-05-CMH-LAX

1448.0 AA-1243-2014-01-21-IAD-DFW

1390.0 AA-1198-2014-01-11-PSP-DFW

1335.0 AA-1680-2014-01-21-SLC-DFW

1296.0 AA-1277-2014-01-21-BWI-DFW

1294.0 MQ-2894-2014-01-02-CVG-DFW

1201.0 MQ-3756-2014-01-01-CLT-MIA

1184.0 DL-2478-2014-01-10-BOS-ATL

Apache Spark

spark.apache.org

github.com/apache/spark

user@spark.apache.org

• Originally developed in 2009 in UC

Berkeley’s AMP Lab

• Fully open sourced in 2010 – now

a Top Level Project at the Apache

Software Foundation

Spark: Fast Big Data

– Rich APIs in

Java, Scala,

Python

– Interactive shell

• Fast to Run

– General execution

graphs

– In-memory storage

2-5× less code

The Spark Community

Spark is the Most Active Open Source Project in Big Data

Giraph

Spark SQL Spark Streaming

(Streaming)

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform

Spark Use Cases

• Iterative Algorithms on on large amounts of data

• Anomaly detection

• Classification

• Predictions

• Recommendations

Why Iterative Algorithms

• Algorithms that need iterations

– Clustering (K-Means, Canopy, …)

– Gradient descent (e.g., Logistic Regression, Matrix Factorization)

– Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,

reachability, centrality, )

– Alternating Least Squares ALS

– Graph communities / dense sub-components

– Inference (believe propagation) – …

Example: Logistic Regression

• Goal: find best line separating two sets of points

target

random initial line

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Iteration!

Logistic Regression

Data Sources

• Local Files

– file:///opt/httpd/logs/access_log

• S3

• Hadoop Distributed Filesystem

– Regular files, sequence files, any other Hadoop InputFormat

• HBase

• other NoSQL data stores

How Spark Works

Spark Programming Model

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

SparkContext

cluster

Worker Node

Task Task

Task Worker Node

Resilient Distributed Datasets (RDD)

Spark revolves around RDDs

• Fault-tolerant

• read only collection of

elements

• operated on in parallel

• Cached in memory

• Or on disk

http://www.cs.berkeley.edu/~matei/papers/

2012/nsdi_spark.pdf

Working With RDDs

textFile = sc.textFile(”SomeFile.txt”)

Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

Working With RDDs

RDD RDD RDD RDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count()

linesWithSpark.first()

# Apache Spark

MapR Tutorial: Getting Started with Spark on MapR Sandbox

• https://www.mapr.com/products/mapr-sandbox-

hadoop/tutorials/spark-tutorial

Example Spark Word Count in Java

. . . the . . .

and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

the, 20

JavaRDD<String> input = sc.textFile(inputFile);

// Split each line into words

JavaRDD<String> words = input.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String x) {

return Arrays.asList(x.split(" "));

// Turn the words into (word, 1) pairs

JavaPairRDD<String, Integer> word1s== words.mapToPair(

new PairFunction<String, String, Integer>(){

public Tuple2<String, Integer> call(String x){

return new Tuple2(x, 1);

// reduce add the pairs by key to produce counts

JavaPairRDD<String, Integer> counts =word1s.reduceByKey(

new Function2<Integer, Integer, Integer>(){

public Integer call(Integer x, Integer y){

return x + y;

. . . . . . . . .

Example Spark Word Count in Scala

. . . the . . .

and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

the, 20

// Load our input data.

val input = sc.textFile(inputFile)

// Split it up into words.

val words = input.flatMap(line => line.split(" "))

// Transform into pairs and count.

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

// Save the word count back out to a text file,

counts.saveAsTextFile(outputFile)

the, 20 time, 4 ….. and, 12

. . . . . . . . .

HadoopRDD

textFile

// Load input data.

partitions

MapPartitionsRDD

// Load our input data.

// Split it up into words.

HadoopRDD

textFile flatmap

MapPartitionsRDD

FlatMap

flatMap

line => line.split(" "))

1 to many mapping

Ships Ships

wax and

JavaPairRDD<String> words

textFile flatmap map

// Transform into pairs

val counts = words.map(word => (word, 1))

HadoopRDD

MapPartitionsRDD

word => (word, 1))

1 to 1 mapping

and and, 1

JavaPairRDD<String, Integer> word1s

textFile flatmap map reduceByKey

val counts = words

HadoopRDD

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD

reduceByKey

case (x, y) => x + y wax, 1

and, 1

wax, 1

and, 2

JavaPairRDD<String, Integer> counts

textFile flatmap map reduceByKey

val counts = words

val countArray = counts.collect()

HadoopRDD

MapPartitionsRDD

collect

ShuffledRDD

Components Of Execution

MapR Blog: Getting Started with the Spark Web UI

• https://www.mapr.com/blog/getting-started-spark-web-ui

Spark RDD DAG -> Physical Execution plan

HadoopRDD

sc.textfile(…)

MapPartitionsRDD

flatmap

reduceByKey

RDD Graph Physical Plan

collect

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD

Stage 1

Stage 2

Physical Plan

Stage 1

Stage 2

Task Task Task Task

Task Task Task

Stage 1

Stage 2

Split into Tasks

Data Node

Worker Node

partition

Executor

Task thread

Scheduler Task

Physical Execution plan -> Stages and Tasks

Summary of Components

• Task : unit of execution

• Stage: Group of Tasks

– Base on partitions of RDD

– Tasks run in parallel

• DAG : Logical Graph of RDD operations

• RDD : Parallel dataset with partitions

How Spark Application runs on a Hadoop cluster

HDFS Data Node

Worker Node

partition task task

Executor

SparkContext

zookeeper

Resource

Manager

HDFS Data Node

Worker Node

partition task task

Executor

Client node

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

Manger

Deploying Spark – Cluster Manager Types

• Standalone mode

• Mesos

• YARN

• EC2

• GCE

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Based on slides from Pat McDonough at

Example: Log Mining

Worker

Driver

Example: Log Mining

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Example: Log Mining

Worker

Driver

Base RDD

Example: Log Mining

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Driver

Example: Log Mining

Worker

Driver

Transformed RDD

Example: Log Mining

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining

messages.cache()

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Example: Log Mining

messages.cache()

Worker

Driver

Block 1

Block 2

Block 3

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Driver tasks

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Driver

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process

& Cache

Process

& Cache

Process

& Cache

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driver

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driver

Process

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driver

results

Example: Log Mining

messages.cache()

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driver

Cache your data Faster Results

Full-text search of Wikipedia

• 60GB on 20 EC2 machines

• 0.5 sec from cache vs. 20s for

on-disk

Transformations and Actions

RDD Transformations and Actions

RDD RDD RDD RDD Transformations Action Value

Transformations (define a new RDD)

map filter sample union groupByKey reduceByKey join cache …

Actions (return a value)

reduce collect count save lookupKey …

Basic Transformations

> nums = sc.parallelize([1, 2, 3])

# Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4}

# Map each element to zero or more others > nums.flatMap(lambda x: => range(x))

> # => {0, 0, 1, 0, 1, 2}

Range object (sequence of

numbers 0, 1, …, x-1)

Basic Actions

> nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3]

# Return first K elements > nums.take(2) # => [1, 2]

# Count number of elements > nums.count() # => 3

# Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)

RDD Fault Recovery

• RDDs track lineage information

• can be used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter

(func = startsWith(…)) map

(func = split(...))

Passing a function to Spark

• Spark is based on Anonymous function syntax

– (x: Int) => x *x

• Which is a shorthand for

new Function1[Int,Int] {

def apply(x: Int) = x * x

Dataframes

DataFrame

Distributed collection of data organized into named columns // Create the DataFrame

val df = sqlContext.read.json("person.json")

// Print the schema in a tree format

df.printSchema()

// root

// |-- age: long (nullable = true)

// |-- name: string (nullable = true)

// |-- height: string (nullable = true)

// Select only the "name" column

df.select("name").show()

https://spark.apache.org/docs/latest/sql-programming-guide.html

DataFrame RDD

• # data frame style

lineitems.groupby(‘customer’).agg(Map(

‘units’ > ‘avg’,

‘totalPrice’ > ‘std’

• # or SQL style

SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer

Demo Interactive Shell

• Iterative Development

– Cache those RDDs

– Open the shell and ask questions

• We have all wished we could do this with MapReduce

– Compile / save your code for scheduled jobs later

• Scala – spark-shell

• Python – pyspark

MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data

• https://www.mapr.com/blog/using-apache-spark-dataframes-

processing-tabular-data

The physical plan for DataFrames

DataFrame Excecution plan

// Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37

There’s a lot more !

Spark SQL Spark Streaming

(Streaming)

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform

Soon to Come

• Spark On Demand Training – https://www.mapr.com/services/mapr-academy/

• Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering

– Spark Streaming

Soon to Come

Blogs and Tutorials: – Re-write this mahout example with spark

Examples and Resources

Spark on MapR

• Certified Spark Distribution

• Fully supported and packaged by MapR in partnership with

Databricks

– mapr-spark package with Spark, Shark, Spark Streaming today

– Spark-python, GraphX and MLLib soon

• YARN integration

– Spark can then allocate resources from cluster when needed

References

• Spark web site: http://spark.apache.org/

• https://databricks.com/

• Spark on MapR:

– http://www.mapr.com/products/apache-spark

• Spark SQL and DataFrame Guide

• Apache Spark vs. MapReduce – Whiteboard Walkthrough

• Learning Spark - O'Reilly Book

• Apache Spark

@mapr maprtech

kbotzum@mapr.com

Engage with us!

maprtech

mapr-technologies

Introduction to Spark on Hadoop

Data & Analytics

Transcript of Introduction to Spark on Hadoop

Installing Hadoop / Spark from scratch

Manchester Hadoop Meetup: Cassandra Spark internals

Cleveland Hadoop Users Group - Spark

Big Data – Spark/Hadoop Data Services · 2017-07-10 · Big Data – Spark/Hadoop . ARC-TS provides a free Hadoop and Spark cluster to our campus. Hadoop allows the manipulation

Apache Spark & Hadoop

1 Big Data Hadoop€¦ · · 2017-09-01Data Sampling and Debugging ... 2 Apache Spark & Scala 1 Introduction to Spark Limitations of MapReduce in Hadoop Objectives ... Cassandra

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Course Content for Hadoop and Spark Batchcompletejavaclasses.com/pdfs/Course Content for Hadoop and Spar… · Course Content for Hadoop and Spark Batch Introduction to BIGDATA and

Hadoop World Spark Meetup: Interactive Spark in your Browser

Spark vs Hadoop

Apache Spark: Moving on from Hadoop

Spark in the Hadoop Ecosystem

Apache Spark beyond Hadoop MapReduce

Brave New World: Hadoop vs. Spark - ETH Zpeople.inf.ethz.ch/.../slides/ML-meetup20-Stockinger-Hadoop-Spark.pdf · Brave New World: Hadoop vs. Spark ... Zurich Machine Learning and

Accelerating Big Data Processing with Hadoop, Spark … · Accelerating Big Data Processing with Hadoop, Spark ... Spark Architecture Overview ... • Challenges for Accelerating

Is Spark Replacing Hadoop

Hadoop to spark-v2

Deep learning on Hadoop/Spark -NextML

Spark & Hadoop at Production at Scale

Adios hadoop, Hola Spark! T3chfest 2015