Spark - IPTricardo/ficheiros/BD - Spark.pdf · When needed, store intermediate results in memory...

Spark

Ricardo Campos

Mestrado EI-IC – Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2018

Instituto Politécnico de Tomar

What is Information Retrieval?

Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:

• Felipe Ortega (Introduction to Big Data Analysis – ASDM Summer School 2018, UPM)


AGENDAWhat is this talk about?

Why

2Motivation

1DataFrames

4RDDs

3

How does it work?

2Application

Components

1Q&A

4Deployment Modes

3


Spark came out of UC Berkley. Matei Zaharia and few other students were working on big data problems and soon realized the limitations of existing platforms like Hadoop.

Until then Hadoop has been at the forefront of big data solutions and was being adopted by many major enterprises like yahoo, facebook, twitter etc. Hadoop's distributed architecture made possible to store terabytes and petabytes of data and run map reduce algorithms to process large volume of data. The storgae component of Hadoop is called HDFS (Hadoop Distributed File System) and processing component is Map Reduce.

The way map reduce is desinged on hadoop is the maps read data from HDFS, send to reducers and reducers write the result back to HDFS.

This works well for algorithm that invovle single stage and very few stages. But algorithms which have hundred or thousands of iterations, this does not work well, as every map reduce stage writes the intermediate data to HDFS. Reading and writing to a file system, which ultimately write to disk, make processing extremely slow.


The other key issues was Map Reduce on Hadoop is not designed for interactive processing. It is primarily designed for batch processing. But interactive processing is very popular in data analysis world

So, the team in UC Berkley designed a new framework which would support interactie and large stage iterative processing easier. Spark would load data from sources like HDFS and keep it on RAM

The other key driver for developing spark was to provide an abstract level programming interface to make analysis easier. There are lot of heavy lifting that need to be done while coding in map reduce on Hadoop framework i.e. lot of verbose in coding.

Spark made it easier by providing abstract level interfaces with out thinking much of map reduce programming paradigm.


RDD - Resilient Distributed Datasets - RDDs are distributed datasets represented in memory. Each RDD is split into multiple partitions and loaded into memory. These partitions can be on multiple nodes in an underlying cluster or on single node depending on how spark is configured to run.

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).


RDD is collection of various data items that are so huge in size, that they cannot fit into a single node and have to be partitioned across various nodes.

Spark automatically partitions RDDs and distributes the partitions across different nodes. A partition in spark is an atomic chunk of data (logical division of data) stored on a node in the cluster.

Partitions are basic units of parallelism in Apache Spark. RDDs in Apache Spark are collection of partitions.


In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as possible

Immutable or Read-Only, i.e. it does not change once created and can only be transformed using transformations to new RDDs.

Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.

Cacheable, i.e. you can hold all the data in a persistent "storage" like memory (default and the most preferred) or disk (the least preferred due to access speed)

Partitioned — records are partitioned (split into logical partitions) and distributed across nodes in a cluster.


There are three ways to create RDDs in Spark:

• Parallelized collections – We can create parallelized collections by invoking parallelize method in the driver program from Python collections (e.g. lists, tuples, etc.).

• External datasets – By calling a textFile method one can create RDDs from files or data directories (local or distributed). This method takes URL of the file and reads it as a collection of lines. Spark is not tightly coupled with any data sources. It can work with local file systems like ext3, XFS or NTFS or can also work with distributed file systems like HDFS.

• Existing RDDs – By applying transformation operation on existing RDDs we can create new RDD.


How data is loaded from sources and how many partitions are created, depends on the data source and how data is read and parsed. For example, if data is loaded from HDFS, by default each block converts into a partition.

Number of partitions can be controlled in spark.

Trade-off solution: higher parallelism vs. available cores and memory in our nodes


A standard RDD contains a collection of elements (lines of text, Python objects, etc.) distributed. But an RDD can also contain key-value pairs among partitions.


RDDs perform two types of operations:

• Transformations which creates a new dataset from the previous RDD (http://spark.apache.org/docs/latest/programming-guide.html#transformations)

• Actions which return a value to the driver program after performing the computation on the dataset (http://spark.apache.org/docs/latest/programming-guide.html#actions)

In a typical analysis, a new RDD is created and stored in RAM when source data is loaded and then the dataset goes through multiple transformations to arrive at any insight. The results will be another RDD. But dataset goes through many transformations in between, so the intermediate results will also be RDDs.


1. Create an initial RDD (from data source, parallelizing a Python collection).

2. Perform transformations to create RDDs, applying functions to distributed data

3. When needed, store intermediate results in memory (cache) to speed up calculations (avoid re-computing data).

4. Execute an action to start parallel execution, optimized and managed by Spark.

Workflow


DAG


Operations performed on an RDD to create another RDD.

Transformations

Lazy evaluation: Only computed when we finally call an action to obtain results.

While we apply only transformations nothing is executed yet. We are just composing the chain of operations that we will apply to our data (DAG).

When we finally call an action, Spark analyzes the complete graph of transformations to optimize its implementation.

It is convenient to use data persistence in memory (cache) if there will be intermediate results frequently accessed by subsequent phases (otherwise they are re-computed each time they are needed for other operation.


List of Transformations


List of Transformations (Key-Value RDDs)


List of Actions


Fault Tolerance

Hadoop and Spark both provides fault tolerance, but both have different approach.

As Hadoop uses commodity hardware, HDFS ensures fault tolerance is by replicating data. Thus master daemons (i.e. NameNode) checks heartbeat of slave daemons (i.e. DataNode).

If any slave daemon fails, master daemons reschedules all pending and in-progress operations to another slave. This method is effective, but it can significantly increase the completion times


In contrast, Spark maintains a DAG (Directed Acyclic Graph), which is a 1 way graph connecting nodes. Where nodes depict the intermediate results you get from your transformations.

In contrast, if a partition of an RDD is lost, it will automatically be recomputed by using the original transformations. This is how Spark provides fault-tolerance.


In contrast, Spark maintains a DAG (Directed Acyclic Graph), which is a 1 way graph connecting nodes. Where nodes depict the intermediate results you get from your transformations. Stage 0 depicts transformations, while Stage 1 shows actions

If because of worker node failure, if any RDD partition is lost we can recompute using lineage graph.

That is, any time your job fails the DAG is re-run from the nearest node of failure to re-compute the RDD.

Assuming that all of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster.


Spark DataFrames are essentially the result of thinking: Spark RDDs are a good way to do distributed data manipulation, but (usually) we need a more tabular data layout and richer query/manipulation operations.

In Spark, a DataFrame is a distributed collection of data organized into named columns.

The basic data structure we'll be using here is a DataFrame. Inspired by Pandas' DataFrames.

It is inherently tabular: has a fixed schema (≈ set of columns) with types, like a database table.

Like an RDD, a DataFrame is an immutable distributed collection of data with lazy evaluation (Which means that a task is not executed until an action is performed.). Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction;


• DataFrames are designed for processing large collection of structured or semi-structured data.

• Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame.

• DataFrame in Apache Spark has the ability to handle petabytes of data.

• DataFrame has a support for wide range of data format and sources.

• It has API support for different languages like Python, R, Scala, Java.

Why DataFrames are Useful?


A DataFrame in Apache Spark can be created in multiple ways:

• It can be created using different data formats. For example, loading the data from JSON, CSV.

• Loading data from Existing RDD.

• Programmatically specifying schema

How to create a DataFrame?


• Operation on Pyspark DataFrame run parallel on different nodes in cluster but, in case of pandas it is not possible.

• Operations in PySpark DataFrame are lazy in nature but, in case of pandas we get the result as soon as we apply any operation

• Pandas API support more operations than PySpark DataFrame. Still pandas API is more powerful than Spark

DataFrames in Spark vs Pandas


Stages - Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to correlate).

Job - A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.

Driver - Driver is the entry point for a spark program. It acts like a coordinator, which starts and controls a set of distributed processes across multiple nodes called executors. Driver along with executors are responsible for running spark applications. Each application will have its own driver and set of executors

Tasks - Each stage has some tasks. A task is the execution of a single stage on a data partition.

Executor - The process responsible for executing a task.


Stages - Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to correlate).

Job - A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.

Driver - Driver is the entry point for a spark program. It acts like a coordinator, which starts and controls a set of distributed processes across multiple nodes called executors. Driver along with executors are responsible for running spark applications. Each application will have its own driver and set of executors. Master - The machine on which the Driver program runs

Tasks - Each stage has some tasks. A task is the execution of a single stage on a data partition.

Executor - The process responsible for executing a task. Slave/Worker - The machine on which the Executor program runs


It works with the system to distribute data across the cluster and process the data in parallel. Spark uses master/slave architecture i.e. one central coordinator and many distributed workers. Here, the central coordinator is called the driver.

The driver runs in its own Java process. These drivers communicate with a potentially large number of distributed workers called executors. Each executor is a separate java process.

A Spark Application is a combination of driver and its own executors. With the help of cluster manager, a Spark Application is launched on a set of machines.


Spark can run in two different modes:

• Local Mode

• Cluster Mode

In Local Mode, spark runs all its components driver and executors on the same local machine. This is primarily used for analyzing compartively smaller datasets and development purposes

In Cluster Mode, Spark runs driver and executors in differnet nodes. It can start as many executors as it wants provided enough nodes and resources are available in the cluster. In cluster mode, Spark can run standalone or can integrate into a cluster management framework like YARN or Mesos.

Spark - IPTricardo/ficheiros/BD - Spark.pdf · When needed, store intermediate results in memory...

Documents

Transcript of Spark - IPTricardo/ficheiros/BD - Spark.pdf · When needed, store intermediate results in memory...