Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c)...

48
Module: Data Ingestion On Sqoop

Transcript of Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c)...

Page 1: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

Module: Data Ingestion On Sqoop

Page 2: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 2 / 48

● At the end of this lesson, students shall be able to:– Understand what is Sqoop and what is its uses and strength– Understand how Sqoop ingest data into HDFS– Understand Sqoop ‘direct’ mode functionality– Understand how to implement full and incremental RDBMS ingestion using Sqoop– Able to use the Sqoop CLI to ingest data from RDBMS

Objectives

Page 3: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 3 / 48

Introduction To Sqoop

● Distributed data ingestion tool for extracting large RDBMS tables

● Distribute ingestion through assigning each mapper different sections/partitions of the source data

● High performance connectors to source– Sqoop provides highly optimized data

extraction strategy for different RDBMS● Oracle● MySQL● PostgreSQL

Page 4: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 4 / 48

SqooP Data Ingestion Architecture

Sqoop CLI

Sqoop Mapper

Sqoop Mapper

Sqoop Mapper

Sqoop AppMaster

Data Block

Data Block

Data Block

File

File

File

Triggers

Manages

OptimizedConnectors

Page 5: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 5 / 48

Sqoop COmmand

● Getting help– sqoop help

● Ingesting from MySQL into Hive with ORC storage

– sqoop import \--connect jdbc:mysql://server/dbname \--table tablename \--hcatalog-database destdbname \--hcatalog-table desttablename \--create-hcatalog-table \--hcatalog-storage-stanza "stored as orc”

Page 6: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 6 / 48

Sqoop ‘Direct’ Mode

● Default sqoop command uses standard JDBC connection to ingest data.● This can be pretty slow for large data sources due to data source may take time to unpack and

prepare data for transmission through JDBC.● Sqoop provides ‘direct’ mode which will attempt to ingest data through database-specific, more

optimized ingestion strategy.● To use direct mode, add --direct option to your sqoop command● Note: Direct mode may require specific requirements for different databases before it can be

used. – Refer to: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_compatibility_notes for more

information

Page 7: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 7 / 48

Implementing Incremental Ingestion With Sqoop

● Incremental ingestion of transactional / log-like data can be done through incremental-append strategy, and requires an incremental identifier such as

– incremental running number OR– entry creation timestamp

● Incremental ingestion of operational tables with updates can be done through incremental-merge strategy, and requires following two (2) fields as dependency

– A unique identifier column– A modification timestamp

Page 8: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 8 / 48

Full Ingestion

Sqoop CLISqoop

Mappers Source Table

Return all data

Writes Data

Workflow

Trigger SqoopFull Ingestion

Hive Table

Start Mappers

Query all data

Page 9: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 9 / 48

Incremental-Append

Sqoop CLISqoop

MappersSource Table

Return filtered data

Writes Data

Workflow

Trigger Sqoop(select * where incremental_field > last_incremental_value)

StagingHive Table

Start Mappers

Query filtered data

Hive Table

Append

Trigger Hive Append

Trigger Hive Drop

Page 10: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 10 / 48

Incremental-Merge

Sqoop CLISqoop

MappersSource Table

Return filtered data

Writes Data

Workflow

Trigger Sqoop(select * where last_modified > last_ingest_date)

StagingHive Table

Start Mappers

Query filtered data

Hive Table

Merge

Trigger Hive Merge(take latest value based on unique ID)

Trigger Hive Drop

Page 11: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 11 / 48

Incremental Merge SQL

CREATE VIEW RECONCILE_VIEW AS

SELECT t2.* FROM

(SELECT *,ROW_NUMBER() OVER (PARTITION BY UNIQUE_ID_COLUMN ORDER BY

LAST_MODIFIED_COLUMN DESC) hive_rn

FROM

(SELECT * FROM HIVE_TABLE

WHERE LAST_MODIFIED_COLUMN <= ${LAST_MODIFIED_TIMESTAMP}

OR LAST_MODIFIED_COLUMN IS NULL

UNION ALL

SELECT * FROM STAGING_TABLE

WHERE LAST_MODIFIED_COLUMN > ${LAST_MODIFIED_TIMESTAMP}) t1) t2

WHERE t2.hive_rn=1;

Page 12: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 12 / 48

Better ways to Incremental-Merge?

● With Hive ACID enabled, there’s possibility to merge through ACID MERGE command

MERGE INTO HIVE_TABLE AS T USING STAGING_TABLE AS SON T.UNIQUE_ID_COLUMN = S.UNIQUE_ID_COLUMN WHEN MATCHED THEN UPDATE SET T.VAL_COLUMN = S.VAL_COLUMN, T.VAL2_COLUMN = S.VAL2_COLUMNWHEN MATCHED AND S.DELETE_COLUMN IS NOT NULL THEN DELETEWHEN NOT MATCHED THEN INSERT VALUES (S.UNIQUE_ID_COLUMN, S.VAL1_COLUMN, S.VAL2_COLUMN, S.LASTMODIFIED_COLUMN, S.DELETE_TIMESTAMP_COLUMN);

● With Hive ACID enabled, a Spark / Python program can also be written to upsert new values from staging table through looping through all the data.

● Alternatively, Change Data Capture / data replication solutions such as SymmetricDS can be made to work with Hive ACID tables

Page 13: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

LAB: INGESTING DATA WITH SQOOP

Page 14: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

Module:Data Ingestion On NiFi

Page 15: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 15 / 48

Objectives

● At the end of this lesson, students shall be able to:– Understand the key components and concepts in a NiFi Flow– Understand NiFi Expression Language– Able to use NiFi to build a data ingestion flow

Page 16: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 16 / 48

Introduction To NiFi

● Centralized, web-based, data flow management tool for moving data from various sources, to various destinations

● Over 200+ processors for:– Extracting data– Filtering data– Transforming data formats– Loading (saving) data

● Highly configurable– Loss tolerant vs guaranteed delivery– Low latency vs high throughput– Dynamic prioritization– Flow can be modified at runtime– Back pressure

● Data Provenance– Track dataflow from beginning to end

Page 17: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 17 / 48

NiFi Use Case

● What Apache NiFi is good at:– Reliable and secure transfer of data between systems– Delivery of data from sources to analytic platforms– Enrichment and preparation of data:– Conversion between formats– Extraction/Parsing– Routing decisions

● What Apache NiFi shouldn’t be used for:– Distributed Computation– Complex Event processing– Joins, rolling windows, aggregates operations

Page 18: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 18 / 48

Key Concept: FlowFile

● FlowFile is basically the data itself.● Consist of 2 components:

– Header attributes– Content body

● Attributes stores metadata of the received file

● Content body stores the actual data itself.

Header Attributes

Content

Data

Page 19: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 19 / 48

Processors

● The actual component that does the work

● Generates FlowFiles or receives FlowFiles and act on it.

● Can be parallelized and load balanced across nodes

● Right click on processor and select Configure to configure the processor

Page 20: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 20 / 48

Processor Configuration : Settings

● Name– Human readable name for the processor

● ID– UUID of the processor object. Can be used

in NiFi REST API

● Automatically Terminate Relationships– Check to terminate the output relationships,

ie: you are not going to configure an output connection for the relationship.

Page 21: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 21 / 48

Processor Configuration: Scheduling

● Scheduling Strategy– Timer driven – periodic– Cron driven – cron-like

scheduling

● Concurrent Tasks– Number of parallel threads which

this task will be running as.

● Run Schedule– Scheduling for tasks

Page 22: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 22 / 48

Processor Configuration: Properties

● Processor-specific configuration

● Refer to processor documentation to know what each properties are for, and how to use them

Page 23: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 23 / 48

Connections

● Represents a data flow queue from one processor to another

● Right click and clicking Configure will load up the connection settings page.

● Right click and clicking List Queue will load up the queue’s FlowFile listing page.

Page 24: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 24 / 48

Connection Settings

● Name– Human readable name of the

connection

● FlowFile expiration– How long a FlowFile shall be queued

in the queue before it is deleted

● Back Pressure Object Threshold– Max number of FlowFiles will be

queued in this queue

● Back Pressure Data Size Threshold– Max size of FlowFiles that can be

queued in this queue

● Selected Prioritizers– Queue priority algorithm

Page 25: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 25 / 48

Connection Item List

● This view list down all FlowFiles queued in the connection

● Clicking the information icon on the leftmost column will load up an information page for the specific FlowFile

Page 26: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 26 / 48

FlowFile Details

Page 27: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 27 / 48

FlowFile Content

● Clicking "View” button from FlowFile details view will load up a page that shows the content of the FlowFile

Page 28: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 28 / 48

Process Group

● Processors and connections can be grouped together into a single unit called Process Group

● Process Groups may have variables for configuring Processors inside it through NiFi Expression Language

● Input / Output ports may be created inside Process Groups to allow connections from outside the Process Group to flow into it.

● Remote Process Group is a connection towards a separate NiFi Cluster.

Page 29: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 29 / 48

Funnel

● Output of multiple connections can be merged into a single flow using a Funnel

● Funnel can also be used to temporarily stage data while downstream processors are still being developed

Page 30: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 30 / 48

Expression Language

● Certain NiFi Processor properties fields can be configured using the NiFi Expression Language (EL).

● In simple words, EL is a simple text templating language for setting the properties field content with programmatically acquired values from FlowFile attributes, or from Process Group variables.

● Example:– "${now():format("yyyy/MM/dd”)” ← this will return current date in format 2018/09/01– "${filename:substring(0,1)}” ← this will return the first character of the content of filename attribute /

variable

● https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Page 31: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

LAB: INGESTING DATA With NiFi

Page 32: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

Module:Data Transformation With Spark

Page 33: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 33 / 48

Objectives

● At the end of this lesson, students shall be able to:– Understand the difference between Hive and Spark– Know the components of Apache Spark– Understand the architecture of job submission for Spark programs on HDP– Understand what is an RDD and how to manipulate RDD using Spark data manipulation /

transformation functions– Understand what is DataFrame and how to manipulate DataFrame using DataFrame API– Understand how to register DataFrame as temporary in-memory table for querying using Spark SQL– Understand how to register custom transformation functions (UDF) and use in Spark SQL– Understand how to save Spark temporary table as Hive table

Page 34: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 34 / 48

APACHE SPARK

● Apache Spark is a general purpose in-memory data analytics tool that can run on Hadoop YARN.● Spark consist of 5 key components

– Spark Core● Core of Spark. Base RDD operations.

– Spark SQL● Data Frames and HiveQL support on Spark

– Spark MLLib● Distributed machine learning algorithms on Spark

– Spark GraphX● Graph computation engine

– Spark Streaming● Stream processing engine using microbatching

● Spark API is supported in multiple languages– Java, Scala, Python, R & SQL

Page 35: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 35 / 48

Hive VS Spark

● Primarily a SQL engine● Option to use Tez (default), Spark and MapReduce

(old) as execution engine● Performance tied to execution engine chosen.● Tez, the default execution engine provide comparative

performance with Spark, if not better.– Data loaded into memory based on needs

● MapReduce, the older engine is slower to process due to high I/O, however more robust to handling failures

● Requires Hadoop core components to function as it runs on YARN

● General purpose data processing tool, supporting:– RDD– DataFrame– SQL– graph processing– machine learning– stream processing.

● Data are loaded into memory before processing● Can run standalone, without running on Hadoop

YARN.

Page 36: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 36 / 48

Spark Job Submission Architecture

Spark LivyServer

Spark2 LivyServer

Knox User / Client Tool

spark-submit

Spark on-demand cluster on YARN

Port 8443

Port 8999Port 8998

Shell onedge node

Page 37: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 37 / 48

Spark Job Submission Tools

● spark-submit command– Command line tool for submitting code to Spark– Default method and well supported by all major Spark and Hadoop distributions– Requires shell access to the node with spark-submit installed

● Livy Server– REST API based code submission.– Newer tools may use Livy for job submission– Allows REST based security for controlling access to Hadoop cluster

Page 38: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 38 / 48

Introduction To RDD

● Resilient Distributed Dataset is an abstraction used in Spark for manipulating in-memory data. ● RDD represents a copy of data in the memory of the Spark cluster which can be manipulated,

transformed, analyzed, etc. ● RDD can be created either through parallelizing data (uploading data) from the driver program, or

referencing to an existing dataset in HDFS, HBase or any filesystem offering Hadoop InputFormat.● RDD can be manipulated using 2 types of functions:

– Transform functions– Action functions

● Transform functions are used to transform the RDD, and are lazy functions. No actual processing will be executed until an Action function is triggered

● Action functions triggers RDD processing and is used to get the transformation results for further processing. When an Action function is completed, the results are transferred over from the in-memory RDD into the driver program.

Page 39: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 39 / 48

In-MemoryRDD

Spark DAG Processing Flow

file partition

task

spark-submit / driver program

user

file partition

file partition

HDFS ClientSpark

task

task

task

Page 40: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

Writing RDD spark program with pyspark

Page 41: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 41 / 48

Spark RDD Transformation Functions

● Transformation functions are used to modify RDDs to different structure.● Transformation functions are 'lazy', ie: only DAG definition is added, no data will be processed until an action function

is triggered● Data manipulation are done through manipulating key-value tuples● Common functions:

– map(func) - Return a new distributed dataset formed by passing each element of the source through a function func.– filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true.– flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq

rather than a single item).– distinct() - Return a new dataset that contains the distinct elements of the source dataset.– groupByKey() - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. – reduceByKey() - When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are

aggregated using the given reduce function func, which must be of type (V,V) => V.– sortByKey([ascending]) - When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V)

pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.– join(otherDataset) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of

elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

● Full list of functions : https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

Page 42: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 42 / 48

Spark RDD Action Functions

● Action functions are used to get the results out of RDDs● Action will trigger the data transformation DAG defined on the RDD.● Common RDD actions are:

– reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

– collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

– count() - Return the number of elements in the dataset.– first() - Return the first element of the dataset (similar to take(1)).– take(n) - Return an array with the first n elements of the dataset.

● Full list of functions: https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Page 43: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 43 / 48

Initializing Spark Context

Initializing Spark Context

● Before any operations can be done using PySpark, you need to initialize a spark context in your driver program.

– This is only necessary for job submitted through spark-submit and Livy, but is not necessary if you are working with the interactive shell as SparkContext will be initialized by the shell itself.

from pyspark import SparkContext, SparkConf

appName = 'MyApp'conf = SparkConf().setAppName(appName).setMaster(master)sc = SparkContext(conf=conf)

Page 44: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 44 / 48

Parallelizing data from driver program to the cluster

● The driver program can read data from local files and load it up into the Spark cluster memory. To do this the .parallelize() function is called with the dataset.

Parallelizing dataset

data = [1,2,3,4,5,6,7,8,9,0]

rdd = sc.parallelize(data)

Page 45: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 45 / 48

Reading Data From HDFS

● Spark support reading various formats from HDFS, such as text files, SequenceFile, Avro, ORC and Parquet. Pure RDD operations however is meant to work with schema-less formats like text files and SequenceFiles

Reading text file from HDFS

rdd1 = sc.textFile('/path/to/textfile') # reading single text filerdd2 = sc.textFile('/path/to/folder/*') # reading multiple filesrdd3 = sc.textFile('/path/to/folder/*.gz') # reading compressed text files

Reading SequenceFiles from HDFS

rdd1 = sc.sequenceFile('/path/to/textfile') # reading single filerdd2 = sc.sequenceFile('/path/to/folder/*') # reading multiple files

Page 46: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 46 / 48

Processing RDD

Transforming RDD (style #1)

rdd = sc.parallelize(['hello'])rdd = rdd.flatMap(lambda x: list(x))# ['h', 'e', 'l', 'l', 'o']rdd = rdd.map(lambda x: (x, 1))# [('h',1),('e',1),('l',1),('l',1),('o',1)]rdd = rdd.reduceByKey(lambda a,b: a+b)# [('h',1),('e',1),('l',2),('o',1)]print(rdd.collect())

Transforming RDD (style #2)

rdd = sc.parallelize(['hello'])rdd = (rdd.flatMap(lambda x: list(x)) .map(lambda x: (x, 1)) .reduceByKey(lambda a,b: a+b))# [('h',1),('e',1),('l',2),('o',1)]print(rdd.collect())

● Once data have been loaded into an RDD, we can then begin to manipulate it using transformation and action functions.

● All transformation functions returns a new RDD with the function added into its DAG, so you can chain RDD transform functions together to create a more complex DAG.

Page 47: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

Page 47 / 48

Submitting Job To Cluster

● To submit your spark program into the cluster, run spark-submit command

Submitting PySpark program throughspark-submit using Spark 2

export SPARK_MAJOR_VERSION=2spark-submit --master yarn --deploy-mode client script.py

Page 48: Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 48 Incremental-Append Sqoop

Lab: Spark RDD Programming