Hadoop With Spark

PowerPoint Presentation

presentsSparkPresented by: Sandy

Introduction to ACADGILD

You can also click on this link to view the video https://www.youtube.com/watch?v=7nipSdxv2Uo Webinar on Spark

2

copyright ACADGILD

Introduction of MentorThe Mentor for this Webinar is Mr. Sandy and below are his qualifications:15 years of experience in IT focusing onBig Data, Data Science and IoTsolutions and implementations.

Expert in the Apache SPARK EcosystemincludingSpark 1.6, Scala, Spark SQL, Spark Streaming, MLLIB , SparkR and GraphX.

Extensive experience inHadoop Framework solutionsincludingYARN,/Mesos HDFS, MapReduce, PigLatin , Hive, HBase/MongoDB/Cassandra, Mahout, Flume, Zookeeper, Oozie and Sqoop.

Knowledge ofMachine Learningfor bothSupervised and Unsupervised Learning Algorithms.

Webinar on Spark3

copyright ACADGILD

Agenda

4 Webinar on SparkSl No.Agenda Title1What is Big data?2MapReduce Limitations3Introduction to Spark4Spark in Hadoop Ecosystem5Why In-memory Processing?6In-memory Caching7Resilient Distributed Dataset 8Creating RDDs9Spark Unified Platform10Popular Use Cases11Apache Spark Case Studies 12Get Your Feet Wet with Spark API's

4

copyright ACADGILD

5

What is Big data? Webinar on Spark5

copyright ACADGILD

MapReduce Limitations

6MapReduce is based on disk based computing. It is more suitable for single pass computations. It is not at all suitable for iterative computations. Disk intensive.

Programming Model limitations:Developing efficient MapReduce applications requires advanced programming skills and deep understanding of the system architecture.

Every problem has to be broken down in to Map and Reduce phases.

Webinar on Spark6

copyright ACADGILD

Introduction to SparkApache Spark is a fast and general-purpose cluster computing system.

Spark is a framework for Scheduling, Monitoring and Distributing the applications.

Spark is a General Unified Engine which can replace many specialized systems like Mahout, Tez, Graphlab, Storm, etc.

Webinar on Spark7

copyright ACADGILD

Some key points about Spark: handles batch, interactive, and real-time within a single framework native integration with Java, Python, Scala programming at a higher level of abstraction more general: map/reduce is just one set of supported constructs7

SQLGraphXMLlibStreaming

RDBMSDistributions:DatabasesFile systemsStreaming sourcesResource ManagersLibrariesAPIsSpark in Hadoop Ecosystem 8Webinar on SparkCDHHDPMap RDSE

copyright ACADGILD

EarlierNow

RAM was very costly. Comparatively, disk was cheap, so disk was primary source of data.Cost of RAM has been sharply reduced with increase in performance. So, RAM is primary source of data and we use disk for fallback.Network was costly so data localityNetwork is faster.Single core machines were dominantMulti core machines are commonplace

Why In-memory processing? Drastic change in hardware 9Webinar on Spark

copyright ACADGILD

9

In-memory Caching 10Webinar on Spark

copyright ACADGILD

Resilient Distributed DatasetResilient distributed dataset (RDD), represents an immutable collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.Its a distributed memory abstraction.

Features:Cache an RDD in memory across machines.Reuse in multiple MapReduce like parallel operations.Fault tolerant through lineage.

11Webinar on Spark

copyright ACADGILD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.

RDD is a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner

Resilient Distributed Datasets (RDDs), a distributed memory abstraction that letsprogrammers perform in-memory computations on large clusters in a fault-tolerantmanner.RDD shard the data over a cluster, like a virtualized, distributed collection

RDD are partitioned, locality aware, distributed collectionsI RDD are immutable

RDD are data structures that:I Either point to a direct data source (e.g. HDFS)I Apply some transformations to its parent RDD(s) to generate new data elementsComputations on RDDsI Represented by lazily evaluated lineage DAGs composed by chained RDDs

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.

In both cases, keeping data in memory can improve performance by an order of magnitude.

RDD Lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition

An RDD can be created 2 ways:-Parallelize a collection-Read data from an external source (S3, C*, HDFS, etc)11

Creating RDDsTurn a collection into an RDD.val a = sc.parallelize(Array(1, 2, 3))

Load text file from local FS, HDFS, or S3.val a = sc.textFile("file.txt")val b = sc.textFile("directory/*.txt")val c = sc.textFile("hdfs://namenode:9000/path/file")

12Webinar on Spark

copyright ACADGILD

There are currently two types: parallelized collections take an existing Scala collection and run functions on it in parallel Hadoop datasets run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop

Spark can create RDDs from any file stored in HDFS or other storage systems supported by Hadoop, e.g., local file system, Amazon S3, Hypertable, HBase, etc.

Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a glob

There are two types of operations on RDDs transformations and actions12

Graph

Spark Core Engine

MLlib Machine Learning

Spark Streaming Streaming

GraphxComputation

Spark RR on Spark

Spark SQL

DataFrameSpark Unified Platform 13Webinar on Spark

copyright ACADGILD

29%36%40%44%52%68%Popular Use Cases14Business Intelligence

Data Warehousing

Recommendation

Log Processing

User-Facing Services

Fraud Detection/ SecurityWebinar on Spark

copyright ACADGILD

Data integration and ETL Interactive analytics or business intelligence High performance batch computation Machine learning and advanced analytics Real-time stream processing

14

Apache Spark Case Studies Credit Card Fraud DetectionNetwork SecurityGenomic SequencingReal-Time Ad Processing15Webinar on Spark

copyright ACADGILD

Get Your Feet Wet with Spark API's

Quick tour of Scala, Python, Java API's

16Webinar on Spark

copyright ACADGILD

Any Questions?Webinar on Spark17

copyright ACADGILD

Get in Touch with Us18Webinar on Spark18

Contact Info:Website : http://www.acadgild.comLinkedIn : https://www.linkedin.com/company/acadgildFacebook : https://www.facebook.com/acadgildSupport: [email protected] copyright ACADGILD

Thank You

Webinar on Spark19

copyright ACADGILD

Hadoop With Spark

Data & Analytics

Transcript of Hadoop With Spark