Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets...

Apache Spark:A Unified Engine for Big Data Processing

Presented by: Huanyi Chen

Apache Spark:A Unified Engine for Big Data Processing§ Engine?

§ Unified?

Apache Spark: A Unified Engine for Big Data Processing PAGE 2

Apache Spark:A Unified Engine for Big Data Processing§ Engine?

§ convert one form of data into other useful forms

§ Unified?§ Multiple types of conversions

§ What is Apache Spark? (Engine)

§ How can it make multiple types of conversions over big data? (Unified)

What is Apache Spark?§ A framework like MapReduce

§ Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

I/O I/O

§ An RDD is a read-only, partitioned collection of records

§ Transformations§ create RDDs (map, filter, join, etc.)

§ Actions§ return a value to the application

§ or export data to a storage system

§ Persistence§ Users can indicate which RDDs they will reuse and choose a storage strategy for them

(e.g., in-memory storage).

§ Partitioning§ Users can ask that an RDD’s elements be partitioned across machines based on a key

in each record.

§ Lineage§ An RDD has enough information about how it was derived from

other datasets.

§ Narrow dependencies§ each partition of the parent RDD is used by at most one partition of

the child RDD

§ Wide dependencies:§ multiple child partitions

Example: run an action on RDD G

MapReduce Ecosystem Spark Ecosystem

MapReduce vs Spark

Higher-Level Libraries

SQL and DataFrames

§ !"#"$%"&'( = *!!( + ,-ℎ'&" = /"01'(§ Spark SQL’s DataFrame API supports inline definition of

user-defined functions (UDFs), without the complicated packaging and registration process found in other database systems.

UDF in MySQL

UDF in Spark SQL

Spark Streaming

Continuous operator processing model Discretized stream processing model

Spark Streaming

GraphX

§ Not able to beat specialized graph-parallel systems itself

§ But outperform them in graph analytics pipeline

§ More than 50 common algorithms for distributed model training

§ Support pipeline construction on Spark

§ Integrate with other Spark libraries well

Why use Apache Spark?

§ Ecosystem

§ Competitive performance

§ Low cost in sharing data

§ Low latency of MapReduce Steps

§ Control over bottleneck resources

Apache Spark in 2016

§ Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs.

§ Apache Spark has grown to 1,000 contributors and thousands of deployments from 2010 to 2016.

Apache Spark Today

§ What is Apache Spark§ Apache Spark = MapReduce + RDDs

§ How can it make multiple types of conversions over big data§ Higher-level libraries enable Apache Spark to handle different types

of big data workload

“Try Apache Spark if you are new to the big data processing world”

Huanyi Chen

Q&A§ What issues will it cause by persisting data in memory? For

example, garbage collection?

§ What are Parallel Random Access Machine model and Bulk Synchronous Parallel model? Are these two models able to model any computation in distributed world?

§ Will optimizing one library cause other libraries to lose performance?

§ Is using memory as the storage really the next generation of storage?

Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets...

Documents

Transcript of Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets...

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or.

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Unified Big Data Processing with Apache Spark · Unified Big Data Processing with Apache Spark ... MapReduce was great for batch processing, ... Spark can leverage most of the latest

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

Apache Spark Operations

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Spark SQL | Apache Spark

Apache Spark & Hadoop

Apache Spark PDF

Apache spark meetup

Apache Spark - Yandex

Apache Spark - LMU

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

Session #2442: Flash-Optimized Apache Spark: Expanding In ... · R Scala SQL Python Java Spark SQL Streaming MLlib GraphX #ibmedge Apache Spark 6 • Unified Analytics Platform –

Developing Apache Spark Applications · Apache Spark Introduction Introduction Apache Spark enables you to quickly develop applications and process jobs. Apache Spark is designed

Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific

with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

MANAGED SOLUTIONS Apache Spark€¦ · 01/10/2015 · Collocated Data Engine Performance Monitoring Unified Memory Management Reliability Automated Provisioning of Spark Jobserver

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed