Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" •...

Introduc+on to Parallel Compu+ng with Apache Spark

Why Parallel Processing? Divide and Conquer

Problem Single machine cannot complete the computa+on at hand Solu+on Parallelize the job and distribute work among a network of machines

Issues Arise in Parallel Processing View the world from the eyes of a single worker

•  How do I parallelize an algorithm? •  How do I par))on my dataset? •  How do I maintain a single consistent view of a shared state?

•  How do I recover from machine failures? •  How do I allocate cluster resources? •  …..

Finding majority element in a single machine

List(20, 18, 20, 18, 20)

Think distributed

Finding majority element in a distributed dataset Think distributed

List(1, 18, 1, 18, 1)

List(2, 18, 2, 18, 2)

List(3, 18, 3, 18, 3)

List(4, 18, 4, 18, 4)

List(5, 18, 5, 18, 5)

Finding majority element in a distributed dataset Think distributed

List(1, 18, 1, 18, 1)

List(2, 18, 2, 18, 2)

List(3, 18, 3, 18, 3)

List(4, 18, 4, 18, 4)

List(5, 18, 5, 18, 5)

Spark Background

•  Amplab UC Berkeley

•  Project Lead: Dr. Matei Zaharia

•  First paper published on RDD’s was in 2012 •  Open sourced from day one, growing number of contributors

•  Released its 1.0 version May 2014. Currently in 1.4.1

•  Databricks company established to support Spark and all its related technologies. Matei currently sits as its CTO

•  Amazon, Alibaba, Baidu, eBay, Groupon, Ooyala, OpenTable, Box, Shopify, TechBase, Yahoo!

Arose from an academic sebng

Example: Log Mining Load error messages from a log into memory, then interac+vely search for various paderns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Driver

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

results

Msgs. 1

Msgs. 2

Msgs. 3

Base RDD Transformed RDD

Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data)

Result: scaled to 1 TB data in 5-‐7 sec (vs 170 sec for on-‐disk data)

Slide courtesy of Matei Zaharia, Introduction to Spark

Resilient Distributed Datasets (RDDs)

• Main object in Spark’s universe

•  Think of it as represen+ng the data at that stage in the opera+on

• Allows for coarse-‐grained transforma+ons (e.g. map, group-‐by, join)

• Allows for efficient fault recovery using lineage ‒ Log one opera+on to apply to many elements ‒ Recompute lost par++ons of dataset on failure ‒ No cost if nothing fails

RDD Ac+ons and Transforma+ons

• Transforma+ons ‒  Lazy opera+ons applied on an RDD ‒  Creates a new RDD from an exis+ng RDD ‒  Allows Spark to perform op+miza+ons ‒  e.g. map, filter, flatMap, union, intersec+on, dis+nct, reduceByKey, groupByKey

• Ac+ons ‒  Returns a value to the driver program aner computa+on ‒  e.g. reduce, collect, count, first, take, saveAsFile

Transforma+ons are realized when an ac+on is called

RDD Representa+on

• Simple common interface: – Set of par++ons – Preferred loca+ons for each par++on – List of parent RDDs – Func+on to compute a par++on given parents – Op+onal par++oning info

• Allows capturing wide range of transforma+ons

Slide courtesy of Matei Zaharia, Introduction to Spark

Spark Cluster

Driver •  Entry point of Spark applica+on •  Main Spark applica+on is ran here •  Results of “reduce” opera+ons are aggregated here

Driver

Spark Cluster

Driver Master

Master •  Distributed coordina+on of Spark workers including:

§  Health checking workers §  Reassignment of failed tasks §  Entry point for job and cluster metrics

Spark Cluster

Driver Master

Worker

Worker •  Spawns executors to perform tasks on par++ons of data

The Spark Family Cheaper by the dozen

•  Aside from its performance and API, the diverse tool set available in Spark is the reason for its wide adop+on 1.  Spark SQL 2.  Spark Streaming 3.  MLlib 4.  GraphX

Disk Based vs Memory Based Frameworks

• Disk Based Frameworks ‒ Persists intermediate results to disk ‒ Data is reloaded from disk with every query ‒ Easy failure recovery ‒ Best for ETL like work-‐loads ‒ Examples: Hadoop, Dryad

Acyclic data flow

Reduce

Input Output

Image courtesy of Matei Zaharia, Introduction to Spark

Disk Based vs Memory Based Frameworks

• Memory Based Frameworks ‒ Circumvents heavy cost of I/O by keeping intermediate results in memory

‒ Sensi+ve to availability of memory ‒ Remembers opera+onsapplied to dataset ‒ Best for itera+ve workloads ‒ Examples: Spark, Flink

Reuse working data set in memory

iter. 1 iter. 2 . . .

Input Image courtesy of Matei Zaharia, Introduction to Spark

Spark versus Scalding (Hadoop) Clear win for itera+ve applica+ons

Ad-‐hoc batch queries

Lambda Architecture

Lambda Architecture Unified Framework

Spark Core

Spark Streaming

Spark SQL

Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" •...

Documents

Transcript of Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" •...

The Doors of Perceptionmural.uv.es/vicordo/firstpaper/etexts/ahdoorsofperception… · Web viewby Aldous Huxley. For M "If the doors of perception were cleansed everything would.

! Office!of!Innovation!and!Technology! …planphilly.com/uploads/rfps/11/oit-documents.pdf · The!purpose!of!this!template!is!to!guide!the!projectlead!to!complete!all!of!the!information!thatwill

Brief introduction of UTA - LEAD projectlead-project.org/sites/default/files/2018-08/11. UTA_LEAD Project... · Brief introduction of UTA • The University of Tampere (UTA) is a

Using a Base-Ten Blocks Learning/Teaching …webschoolpro.com/home/projectlead/Research Articles and links/Base...Journal for Research in Mathematics Education 1990, Vol. 21, No. 3,

2016-01-28 UvA HPC Spark · Transformations RDD’s are created from other RDD’s using transformations: map(f) => pass every element through function f reduceByKey(f) => aggregate

Heaven and Hell - mural.uv.esmural.uv.es/vicordo/firstpaper/etexts/ahheavenandhell.doc · Web viewby Aldous Huxley. In the history of science the collector of specimens preceded

· Web viewSenior Projectleader and IT/Business Consultant. SIGNAL IDUNA Gruppe in Hamburg (Germany) 05/17 - 04/18. Senior Projectlead, IT-/Business Consultant. Deutsche Post in

THE ART OF SEEING - UVmural.uv.es/vicordo/firstpaper/etexts/1941artofseeing.doc · Web viewI feel certain that, if I learn the art of seeing, I can improve my own defective vision.

A Presentation About: Discretized Streams: A Fault ...istoica/classes/... · Fault Tolerance and Stagger Handling Fault Tolerance Parallel Recovery Recompute failed RDD’s data in

CAD Automation with Rulestream at Siemens Large Drives...Project Organization Projectlead André Hieke (GS IT PD LD PLM-1) Business Project Lead QMiP/Controlling Steering Committee

Introduction to Apache Sparkcs451/slides/big-data-part03b.pdf · Key Concept: RDD’s Resilient Distributed Datasets •Collections of objects spread across a cluster, stored in RAM

Introduction to Apache Spark€¦ · •The Spark programming model •Language and deployment choices •Example algorithm (PageRank) Key Concept: RDD’s Resilient Distributed Datasets