Everything you wanted to know, but were afraid to ask about Oozie
-
date post
19-Oct-2014 -
Category
Technology
-
view
9.632 -
download
2
description
Transcript of Everything you wanted to know, but were afraid to ask about Oozie
Everything that you ever wanted to know about Oozie, but were
afraid to ask
B Lublinsky, A Yakubovich
Apache Oozie• Oozie is a workflow/coordination system to
manage Apache Hadoop jobs.• A single Oozie server implements all four
functional Oozie components:– Oozie workflow– Oozie coordinator– Oozie bundle– Oozie SLA.
Main components
Data
Oozie Server
Coordinator
Coordinator
Hadoop
Coordinator
Oozie Command Line Interface
3rd party application
definitions,states
WS API
job submissionand monitoring
workflow
action
action
action
action
Oozie shared libraries
Coordinator
wf logic
Bundle
CoordinatorCoordinatorBundle
CoordinatorCoordinatorWorkflow
time condition monitoring
data condition monitoring
HDFS
MapReduce
Oozie workflow
Workflow LanguageFlow-control node
XML element type Description
Decision workflow:DECISION expressing “switch-case” logic
Fork workflow:FORK splits one path of execution into multiple concurrent pathsJoin workflow:JOIN waits until every concurrent execution path of a previous fork
node arrives to itKill workflow:kill forces a workflow job to kill (abort) itself
Action node XML element type Descriptionjava workflow:JAVA invokes the main() method from the specified java classfs workflow:FS manipulate files and directories in HDFS; supports commands:
move, delete, mkdirMapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,
streaming job or pipe jobPig workflow:pig runs a Pig jobSub workflow workflow:SUB-
WORKFLOWruns a child workflow job
Hive * workflow:HIVE runs a Hive jobShell * workflow:SHELL runs a Shell commandssh * workflow:SSH starts a shell command on a remote machine as a remote secure
shellSqoop * workflow:SQOOP runs a Sqoop jobEmail * workflow:EMAIL sending emails from Oozie workflow applicationDistcp ? Under development (Yahoo)
Workflow actions
ActionStartCommand J avaActionExecutorWorkflowStore Services J obClientActionExecutorContext
1 : workflow := getWorkflow()
2 : action := getAction()
3 : context := init<>()
4 : executor := get()
5 : start()
6 : submitLauncher()
7 : jobClient := get()
8 : runningJ ob := submit()
9 : setStartData()
• Oozie workflow supports two types of actions: Synchronous, executed inside Oozie runtime Asynchronous, executed as a Map Reduce job.
Workflow lifecycle
PREP
RUNNINGKILLED
SUSPENDED
FAILED
SUCCEDDED
Oozie execution console
Extending Oozie workflow• Oozie provides a “minimal” workflow language, which contains
only a handful of control and actions nodes.• Oozie supports a very elegant extensibility mechanism – custom
action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs).
• Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class.– Implementation of the action’s XML schema defining action’s
configuration parameters– Packaging of java implementation and configuration schema into
action jar, which has to be added to Oozie war– extending oozie-site.xml to register information about custom executor
with Oozie runtime.
Oozie Workflow Client• Oozie provides an easy way for integration with enterprise
applications through Oozie client APIs. It provides two types of APIs
• REST HTTP APINumber of HTTP requests• Info requests (job status, job configuration)• Job management (submit, start, suspend, resume, kill)Example: job definition info request
GET /oozie/v0/job/job-ID?show=definition• Java API - package org.apache.oozie.client
– OozieClientstart(), submit(), run(), reRunXXX(), resume(), kill(), suspend()
– WorkflowJob, WorkflowAction– CoordinatorJob, CoordinatorAction – SLAEvent
Oozie workflow good, bad and ugly
• Good– Nice integration with Hadoop ecosystem, allowing to easily build
processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs.
– Nice UI for tracking execution progress– Simple APIs for integration with other applications– Simple extensibility APIs
• Bad– Process has to be expressed directly in hPDL with no visual support– No support for Uber Jars (but we added our own)
• Ugly– Static forking (but you can regenerate workflow and invoke on a fly)– No support for loops
Oozie Coordinator
Coordinator languageElement type Description Attributes and sub-elementscoordinator-app
top-level element in coordinator instance frequencystartend
controls specify the execution policy for coordinator and it’s elements (workflow actions)
timeout (actions)concurrency (actions)execution order (workflow instances)
action Required singular element specifying the associated workflow. The jobs specified in workflow consume and produce dataset instances
Workflow name
datasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances
input event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action
output event specifies the dataset that should be produced by coordinator action
Coordinator lifecycle
Oozie Bundle
Bundle lifecycle
RUNNINGPREPSUSPENDED KILLED
SUSPENDED
PREP
FAILEDSUCCEDDED PAUSED
PREPPAUSED
Oozie SLA
SLA Navigation
· event_id· alert_contact· alert-frieuency· …· sla_id· ...
SLA_EVENT
· id· app_name· app_path· …
COORD_JOBS
· id· action_number· action_xml· …· external_id· ...
COORD_ACTIONS
· id· conf· console_url· …
· id· app_name· app_path· …
WF_ACTIONS
WF_JOBS
Using Probes to analyze/monitor Places
• Select probe data for specified time/location• Validate – Filter - Transform probe data• Calculate statistics on available probe data• Distribute data per geo-tiles• Calculate place statistics (e.g. attendance index)-------------------------------------------------------------If exception condition happens, report failureIf all steps succeeded, report success
Workflow as acyclic graph
Workflow – fragment 1
Workflow – fragment 2
Oozie tips and tricks
Configuring workflow• Oozie provides 3 overlapping mechanisms to configure workflow -
config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations.
• The way Oozie processes these three sets of the parameters is as follows:– Use all of the parameters from command line invocation– For remaining unresolved parameters, job config is used– Use config-default.xml for everything else
• Although documentation does not describe clearly when to use which, the overall recommendation is as follows:– Use config-default.xml for defining parameters that never change for a given
workflow– Use jobs properties for the parameters that are common for a given
deployment of a workflow– Use command line arguments for the parameters that are specific for a given
workflow invocation.
Accessing and storing process variables
• Accessing– Through the arguments in java main
• StoringString ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName));Properties props = new Properties();props.setProperty(key, value);props.store(os, "");os.close();
Validating data presence
• Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set -
technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not.
– custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc.
• Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.
Invoking map Reduce jobs• Oozie provides two different ways of invoking Map Reduce job –
MapReduce action and java action. • Invocation of Map Reduce job with java action is somewhat
similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages:– The same driver class can be used for both – running Map Reduce job
from an edge node and a java action in an Oozie process.– A driver provides a convenient place for executing additional code, for
example clean-up required for Map Reduce execution.• Driver requires a proper shutdown hook to ensure that there are
no lingering Map Reduce jobs
Implementing predefined looping and forking
• hPDL is an XML document with the well-defined schema.
• This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler.
• This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions.
• The other option is creation of template process and modifying it based on calculated parameters.
Oozie client security (or lack of)• By default Oozie client reads clients identity from the
local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation
• Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor.
public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }
uber jars with Oozieuber jar contains resources: other jars, so libraries, zip files
<java> … <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> …</java>
Oozie server
launcher java action
unpack resourcesto current uber jar dir
set inverse classloader
invoke MR driverpass arguments
set shutdown hook‘wait for complete’
uber jar
Classes (Launcher)
jars so zip
mappermapper