Data Pipeline Management Framework on Oozie

10
Data Pipeline Management Framework on Oozie Kun Lu

description

Architecture of campaign analytics The issues in the old campaign analytics processes Building pipeline management framework for robust computing environment

Transcript of Data Pipeline Management Framework on Oozie

Page 1: Data Pipeline Management Framework on Oozie

Data Pipeline Management

Framework on Oozie

Kun Lu

Page 2: Data Pipeline Management Framework on Oozie

OverviewArchitecture of Campaign Analytics

What are the issues in the old Campaign Analytics processes

Build Pipeline Management Framework for robust computing environment

Page 3: Data Pipeline Management Framework on Oozie

Architecture of Campaign Analytics

Page 4: Data Pipeline Management Framework on Oozie

What are the issues the framework needs to solve

Consistent and robust frameworkAdding a new analytics job more

easier Ability to coordinate complex

workflows (serialized and parallel processing)

It should support the catch-up feature

It should make debugging and tracing easier

Page 5: Data Pipeline Management Framework on Oozie

What does Oozie provide?Workflow Engine

Workflow definitionA DAG with control flow nodes or action nodes (connected

with transition arrows)

Workflow NodesControl flow nodes (start, end, decision, fork, join, kill

node)Action nodes (Map-reduce, pig, Java, Script, etc.)

Parameterization of WorkflowJob PropertiesEL functions (Basic EL, WF EL, Hadoop EL, HDFS EL)

Oozie Console

Oozie Client and API

Page 6: Data Pipeline Management Framework on Oozie

Workflow Design Pattern

Page 7: Data Pipeline Management Framework on Oozie

Campaign Analytics Pipeline Management FrameworkCampaign Analytics Pipeline Management

Framework(PMF) is built on top of Oozie.

PMF defines campaign analytics processing pipeline. Each pipeline includes a set of workflows.

PMF organizes, schedules and coordinates the campaign analytics jobs. It also provides the built-in catch-up feature to make the pipeline robust.

Oozie workflow engine executes workflows and sending jobs status to Oozie server.

Monitoring/Tracing jobs through Oozie console.

Page 8: Data Pipeline Management Framework on Oozie

PMF & Oozie Execution Env.

PMF ServersOwn Pipeline definitionPassing workflow tasks to Oozie through Ooize

client

Oozie ServerExecutes workflow tasksManages task status

Hadoop ClusterWorkflow definition deployed in HDFSM/R processes run on the cluster

Oozie Console

Page 9: Data Pipeline Management Framework on Oozie

Workflow Console

Page 10: Data Pipeline Management Framework on Oozie

Current WorkflowsPMF manages three pipelines

(hourly pipeline, daily pipeline, and weekly pipeline)

Includes 12 workflows

Map/Reduce Jobs run per month: ~100,000 jobs