Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c)...

Module: Data Ingestion On Sqoop

www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd

/ 48

● At the end of this lesson, students shall be able to:– Understand what is Sqoop and what is its uses and strength– Understand how Sqoop ingest data into HDFS– Understand Sqoop ‘direct’ mode functionality– Understand how to implement full and incremental RDBMS ingestion using Sqoop– Able to use the Sqoop CLI to ingest data from RDBMS

Objectives

http://www.abyres.net/


/ 48

Introduction To Sqoop

● Distributed data ingestion tool for extracting large RDBMS tables

● Distribute ingestion through assigning each mapper different sections/partitions of the source data

● High performance connectors to source– Sqoop provides highly optimized data

extraction strategy for different RDBMS● Oracle● MySQL● PostgreSQL



/ 48

SqooP Data Ingestion Architecture

Sqoop CLI

Sqoop Mapper

Sqoop Mapper

Sqoop Mapper

Sqoop AppMaster

Data Block

Data Block

Data Block

File

File

File

Triggers

Manages

OptimizedConnectors



/ 48

Sqoop COmmand

● Getting help– sqoop help

● Ingesting from MySQL into Hive with ORC storage

– sqoop import \--connect jdbc:mysql://server/dbname \--table tablename \--hcatalog-database destdbname \--hcatalog-table desttablename \--create-hcatalog-table \--hcatalog-storage-stanza "stored as orc”



/ 48

Sqoop ‘Direct’ Mode

● Default sqoop command uses standard JDBC connection to ingest data.● This can be pretty slow for large data sources due to data source may take time to unpack and

prepare data for transmission through JDBC.● Sqoop provides ‘direct’ mode which will attempt to ingest data through database-specific, more

optimized ingestion strategy.● To use direct mode, add --direct option to your sqoop command● Note: Direct mode may require specific requirements for different databases before it can be

used. – Refer to: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_compatibility_notes for more

information


https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_compatibility_notes


/ 48

Implementing Incremental Ingestion With Sqoop

● Incremental ingestion of transactional / log-like data can be done through incremental-append strategy, and requires an incremental identifier such as

– incremental running number OR– entry creation timestamp

● Incremental ingestion of operational tables with updates can be done through incremental-merge strategy, and requires following two (2) fields as dependency

– A unique identifier column– A modification timestamp



/ 48

Full Ingestion

Sqoop CLISqoop

Mappers Source Table

Return all data

Writes Data

Workflow

Trigger SqoopFull Ingestion

Hive Table

Start Mappers

Query all data



/ 48

Incremental-Append

Sqoop CLISqoop

MappersSource Table

Return filtered data

Writes Data

Workflow

Trigger Sqoop(select * where incremental_field > last_incremental_value)

StagingHive Table

Start Mappers

Query filtered data

Hive Table

Append

Trigger Hive Append

Trigger Hive Drop



/ 48

Incremental-Merge

Sqoop CLISqoop

MappersSource Table

Return filtered data

Writes Data

Workflow

Trigger Sqoop(select * where last_modified > last_ingest_date)

StagingHive Table

Start Mappers

Query filtered data

Hive Table

Merge

Trigger Hive Merge(take latest value based on unique ID)

Trigger Hive Drop



/ 48

Incremental Merge SQL

CREATE VIEW RECONCILE_VIEW AS

SELECT t2.* FROM

(SELECT *,ROW_NUMBER() OVER (PARTITION BY UNIQUE_ID_COLUMN ORDER BY

LAST_MODIFIED_COLUMN DESC) hive_rn

FROM

(SELECT * FROM HIVE_TABLE

WHERE LAST_MODIFIED_COLUMN <= ${LAST_MODIFIED_TIMESTAMP}

OR LAST_MODIFIED_COLUMN IS NULL

UNION ALL

SELECT * FROM STAGING_TABLE

WHERE LAST_MODIFIED_COLUMN > ${LAST_MODIFIED_TIMESTAMP}) t1) t2

WHERE t2.hive_rn=1;



/ 48

Better ways to Incremental-Merge?

● With Hive ACID enabled, there’s possibility to merge through ACID MERGE command

MERGE INTO HIVE_TABLE AS T USING STAGING_TABLE AS SON T.UNIQUE_ID_COLUMN = S.UNIQUE_ID_COLUMN WHEN MATCHED THEN UPDATE SET T.VAL_COLUMN = S.VAL_COLUMN, T.VAL2_COLUMN = S.VAL2_COLUMNWHEN MATCHED AND S.DELETE_COLUMN IS NOT NULL THEN DELETEWHEN NOT MATCHED THEN INSERT VALUES (S.UNIQUE_ID_COLUMN, S.VAL1_COLUMN, S.VAL2_COLUMN, S.LASTMODIFIED_COLUMN, S.DELETE_TIMESTAMP_COLUMN);

● With Hive ACID enabled, a Spark / Python program can also be written to upsert new values from staging table through looping through all the data.

● Alternatively, Change Data Capture / data replication solutions such as SymmetricDS can be made to work with Hive ACID tables


LAB: INGESTING DATA WITH SQOOP

Module:Data Ingestion On NiFi


/ 48

Objectives

● At the end of this lesson, students shall be able to:– Understand the key components and concepts in a NiFi Flow– Understand NiFi Expression Language– Able to use NiFi to build a data ingestion flow



/ 48

Introduction To NiFi

● Centralized, web-based, data flow management tool for moving data from various sources, to various destinations

● Over 200+ processors for:– Extracting data– Filtering data– Transforming data formats– Loading (saving) data

● Highly configurable– Loss tolerant vs guaranteed delivery– Low latency vs high throughput– Dynamic prioritization– Flow can be modified at runtime– Back pressure

● Data Provenance– Track dataflow from beginning to end



/ 48

NiFi Use Case

● What Apache NiFi is good at:– Reliable and secure transfer of data between systems– Delivery of data from sources to analytic platforms– Enrichment and preparation of data:– Conversion between formats– Extraction/Parsing– Routing decisions

● What Apache NiFi shouldn’t be used for:– Distributed Computation– Complex Event processing– Joins, rolling windows, aggregates operations



/ 48

Key Concept: FlowFile

● FlowFile is basically the data itself.● Consist of 2 components:

– Header attributes– Content body

● Attributes stores metadata of the received file

● Content body stores the actual data itself.

Header Attributes

Content

Data



/ 48

Processors

● The actual component that does the work

● Generates FlowFiles or receives FlowFiles and act on it.

● Can be parallelized and load balanced across nodes

● Right click on processor and select Configure to configure the processor



/ 48

Processor Configuration : Settings

● Name– Human readable name for the processor

● ID– UUID of the processor object. Can be used

in NiFi REST API

● Automatically Terminate Relationships– Check to terminate the output relationships,

ie: you are not going to configure an output connection for the relationship.



/ 48

Processor Configuration: Scheduling

● Scheduling Strategy– Timer driven – periodic– Cron driven – cron-like

scheduling

● Concurrent Tasks– Number of parallel threads which

this task will be running as.

● Run Schedule– Scheduling for tasks



/ 48

Processor Configuration: Properties

● Processor-specific configuration

● Refer to processor documentation to know what each properties are for, and how to use them



/ 48

Connections

● Represents a data flow queue from one processor to another

● Right click and clicking Configure will load up the connection settings page.

● Right click and clicking List Queue will load up the queue’s FlowFile listing page.



/ 48

Connection Settings

● Name– Human readable name of the

connection

● FlowFile expiration– How long a FlowFile shall be queued

in the queue before it is deleted

● Back Pressure Object Threshold– Max number of FlowFiles will be

queued in this queue

● Back Pressure Data Size Threshold– Max size of FlowFiles that can be

queued in this queue

● Selected Prioritizers– Queue priority algorithm



/ 48

Connection Item List

● This view list down all FlowFiles queued in the connection

● Clicking the information icon on the leftmost column will load up an information page for the specific FlowFile



/ 48

FlowFile Details



/ 48

FlowFile Content

● Clicking "View” button from FlowFile details view will load up a page that shows the content of the FlowFile



/ 48

Process Group

● Processors and connections can be grouped together into a single unit called Process Group

● Process Groups may have variables for configuring Processors inside it through NiFi Expression Language

● Input / Output ports may be created inside Process Groups to allow connections from outside the Process Group to flow into it.

● Remote Process Group is a connection towards a separate NiFi Cluster.



/ 48

Funnel

● Output of multiple connections can be merged into a single flow using a Funnel

● Funnel can also be used to temporarily stage data while downstream processors are still being developed



/ 48

Expression Language

● Certain NiFi Processor properties fields can be configured using the NiFi Expression Language (EL).

● In simple words, EL is a simple text templating language for setting the properties field content with programmatically acquired values from FlowFile attributes, or from Process Group variables.

● Example:– "${now():format("yyyy/MM/dd”)” ← this will return current date in format 2018/09/01– "${filename:substring(0,1)}” ← this will return the first character of the content of filename attribute /

variable

● https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html


https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

LAB: INGESTING DATA With NiFi

Module:Data Transformation With Spark


/ 48

Objectives

● At the end of this lesson, students shall be able to:– Understand the difference between Hive and Spark– Know the components of Apache Spark– Understand the architecture of job submission for Spark programs on HDP– Understand what is an RDD and how to manipulate RDD using Spark data manipulation /

transformation functions– Understand what is DataFrame and how to manipulate DataFrame using DataFrame API– Understand how to register DataFrame as temporary in-memory table for querying using Spark SQL– Understand how to register custom transformation functions (UDF) and use in Spark SQL– Understand how to save Spark temporary table as Hive table



/ 48

APACHE SPARK

● Apache Spark is a general purpose in-memory data analytics tool that can run on Hadoop YARN.● Spark consist of 5 key components

– Spark Core● Core of Spark. Base RDD operations.

– Spark SQL● Data Frames and HiveQL support on Spark

– Spark MLLib● Distributed machine learning algorithms on Spark

– Spark GraphX● Graph computation engine

– Spark Streaming● Stream processing engine using microbatching

● Spark API is supported in multiple languages– Java, Scala, Python, R & SQL



/ 48

Hive VS Spark

● Primarily a SQL engine● Option to use Tez (default), Spark and MapReduce

(old) as execution engine● Performance tied to execution engine chosen.● Tez, the default execution engine provide comparative

performance with Spark, if not better.– Data loaded into memory based on needs

● MapReduce, the older engine is slower to process due to high I/O, however more robust to handling failures

● Requires Hadoop core components to function as it runs on YARN

● General purpose data processing tool, supporting:– RDD– DataFrame– SQL– graph processing– machine learning– stream processing.

● Data are loaded into memory before processing● Can run standalone, without running on Hadoop

YARN.



/ 48

Spark Job Submission Architecture

Spark LivyServer

Spark2 LivyServer

Knox User / Client Tool

spark-submit

Spark on-demand cluster on YARN

Port 8443

Port 8999Port 8998

Shell onedge node



/ 48

Spark Job Submission Tools

● spark-submit command– Command line tool for submitting code to Spark– Default method and well supported by all major Spark and Hadoop distributions– Requires shell access to the node with spark-submit installed

● Livy Server– REST API based code submission.– Newer tools may use Livy for job submission– Allows REST based security for controlling access to Hadoop cluster



/ 48

Introduction To RDD

● Resilient Distributed Dataset is an abstraction used in Spark for manipulating in-memory data. ● RDD represents a copy of data in the memory of the Spark cluster which can be manipulated,

transformed, analyzed, etc. ● RDD can be created either through parallelizing data (uploading data) from the driver program, or

referencing to an existing dataset in HDFS, HBase or any filesystem offering Hadoop InputFormat.● RDD can be manipulated using 2 types of functions:

– Transform functions– Action functions

● Transform functions are used to transform the RDD, and are lazy functions. No actual processing will be executed until an Action function is triggered

● Action functions triggers RDD processing and is used to get the transformation results for further processing. When an Action function is completed, the results are transferred over from the in-memory RDD into the driver program.



/ 48

In-MemoryRDD

Spark DAG Processing Flow

file partition

task

spark-submit / driver program

user

file partition

file partition

HDFS ClientSpark

task

task

task


Writing RDD spark program with pyspark


/ 48

Spark RDD Transformation Functions

● Transformation functions are used to modify RDDs to different structure.● Transformation functions are 'lazy', ie: only DAG definition is added, no data will be processed until an action function

is triggered● Data manipulation are done through manipulating key-value tuples● Common functions:

– map(func) - Return a new distributed dataset formed by passing each element of the source through a function func.– filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true.– flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq

rather than a single item).– distinct() - Return a new dataset that contains the distinct elements of the source dataset.– groupByKey() - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. – reduceByKey() - When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are

aggregated using the given reduce function func, which must be of type (V,V) => V.– sortByKey([ascending]) - When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V)

pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.– join(otherDataset) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of

elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

● Full list of functions : https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations


https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations


/ 48

Spark RDD Action Functions

● Action functions are used to get the results out of RDDs● Action will trigger the data transformation DAG defined on the RDD.● Common RDD actions are:

– reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

– collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

– count() - Return the number of elements in the dataset.– first() - Return the first element of the dataset (similar to take(1)).– take(n) - Return an array with the first n elements of the dataset.

● Full list of functions: https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions


https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions


/ 48

Initializing Spark Context

Initializing Spark Context

● Before any operations can be done using PySpark, you need to initialize a spark context in your driver program.

– This is only necessary for job submitted through spark-submit and Livy, but is not necessary if you are working with the interactive shell as SparkContext will be initialized by the shell itself.

from pyspark import SparkContext, SparkConf

appName = 'MyApp'conf = SparkConf().setAppName(appName).setMaster(master)sc = SparkContext(conf=conf)



/ 48

Parallelizing data from driver program to the cluster

● The driver program can read data from local files and load it up into the Spark cluster memory. To do this the .parallelize() function is called with the dataset.

Parallelizing dataset

data = [1,2,3,4,5,6,7,8,9,0]

rdd = sc.parallelize(data)



/ 48

Reading Data From HDFS

● Spark support reading various formats from HDFS, such as text files, SequenceFile, Avro, ORC and Parquet. Pure RDD operations however is meant to work with schema-less formats like text files and SequenceFiles

Reading text file from HDFS

rdd1 = sc.textFile('/path/to/textfile') # reading single text filerdd2 = sc.textFile('/path/to/folder/*') # reading multiple filesrdd3 = sc.textFile('/path/to/folder/*.gz') # reading compressed text files

Reading SequenceFiles from HDFS

rdd1 = sc.sequenceFile('/path/to/textfile') # reading single filerdd2 = sc.sequenceFile('/path/to/folder/*') # reading multiple files



/ 48

Processing RDD

Transforming RDD (style #1)

rdd = sc.parallelize(['hello'])rdd = rdd.flatMap(lambda x: list(x))# ['h', 'e', 'l', 'l', 'o']rdd = rdd.map(lambda x: (x, 1))# [('h',1),('e',1),('l',1),('l',1),('o',1)]rdd = rdd.reduceByKey(lambda a,b: a+b)# [('h',1),('e',1),('l',2),('o',1)]print(rdd.collect())

Transforming RDD (style #2)

rdd = sc.parallelize(['hello'])rdd = (rdd.flatMap(lambda x: list(x)) .map(lambda x: (x, 1)) .reduceByKey(lambda a,b: a+b))# [('h',1),('e',1),('l',2),('o',1)]print(rdd.collect())

● Once data have been loaded into an RDD, we can then begin to manipulate it using transformation and action functions.

● All transformation functions returns a new RDD with the function added into its DAG, so you can chain RDD transform functions together to create a more complex DAG.



/ 48

Submitting Job To Cluster

● To submit your spark program into the cluster, run spark-submit command

Submitting PySpark program throughspark-submit using Spark 2

export SPARK_MAJOR_VERSION=2spark-submit --master yarn --deploy-mode client script.py


Lab: Spark RDD Programming

Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c)...

Documents

Transcript of Module: Data Ingestion On Sqooprepo.kagesenshi.org/hdptraining/training-slides-extended.pdf · (c)...