Data Pipeline Management Framework on Oozie

Post on 10-May-2015

499 views 4 download

Tags:

description

Architecture of campaign analytics The issues in the old campaign analytics processes Building pipeline management framework for robust computing environment

Transcript of Data Pipeline Management Framework on Oozie

Data Pipeline Management

Framework on Oozie

Kun Lu

OverviewArchitecture of Campaign Analytics

What are the issues in the old Campaign Analytics processes

Build Pipeline Management Framework for robust computing environment

Architecture of Campaign Analytics

What are the issues the framework needs to solve

Consistent and robust frameworkAdding a new analytics job more

easier Ability to coordinate complex

workflows (serialized and parallel processing)

It should support the catch-up feature

It should make debugging and tracing easier

What does Oozie provide?Workflow Engine

Workflow definitionA DAG with control flow nodes or action nodes (connected

with transition arrows)

Workflow NodesControl flow nodes (start, end, decision, fork, join, kill

node)Action nodes (Map-reduce, pig, Java, Script, etc.)

Parameterization of WorkflowJob PropertiesEL functions (Basic EL, WF EL, Hadoop EL, HDFS EL)

Oozie Console

Oozie Client and API

Workflow Design Pattern

Campaign Analytics Pipeline Management FrameworkCampaign Analytics Pipeline Management

Framework(PMF) is built on top of Oozie.

PMF defines campaign analytics processing pipeline. Each pipeline includes a set of workflows.

PMF organizes, schedules and coordinates the campaign analytics jobs. It also provides the built-in catch-up feature to make the pipeline robust.

Oozie workflow engine executes workflows and sending jobs status to Oozie server.

Monitoring/Tracing jobs through Oozie console.

PMF & Oozie Execution Env.

PMF ServersOwn Pipeline definitionPassing workflow tasks to Oozie through Ooize

client

Oozie ServerExecutes workflow tasksManages task status

Hadoop ClusterWorkflow definition deployed in HDFSM/R processes run on the cluster

Oozie Console

Workflow Console

Current WorkflowsPMF manages three pipelines

(hourly pipeline, daily pipeline, and weekly pipeline)

Includes 12 workflows

Map/Reduce Jobs run per month: ~100,000 jobs