Nova: Continuous Pig/ Hadoop Workflows

16
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS

description

Nova: Continuous Pig/ Hadoop Workflows. storage & processing. Nova. workflow manager e.g. Nova. Pig. dataflow programming framework e.g. Pig. distributed sorting & hashing e.g. Map-Reduce. scalable file system e.g. HDFS. Nova Overview. - PowerPoint PPT Presentation

Transcript of Nova: Continuous Pig/ Hadoop Workflows

Page 1: Nova: Continuous Pig/ Hadoop  Workflows

NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS

Page 2: Nova: Continuous Pig/ Hadoop  Workflows

storage & processing

scalable file systeme.g. HDFS

distributed sorting & hashinge.g. Map-Reduce

dataflow programming framework

e.g. Pig

workflow managere.g. Nova

Pig

Nova

Page 3: Nova: Continuous Pig/ Hadoop  Workflows

Nova Overview

Nova: a system for batched incremental processing.

Scenarios: Yahoo Ingesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web

pages Processing semi-structured data (news, blogs, etc.)

Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features

Page 4: Nova: Continuous Pig/ Hadoop  Workflows

Continuous Processing

- Nova: An outer workflow manager layer, deals with graphs of interconnected Pig programs, with data passing in a continuous fashion.- Pig/Hadoop: Inner layer, merely deals with transforming static input data into static output data.

Nova: keeps track of “delta” data and routs them to the workflow components in the right order.

Input Output

Delta

Page 5: Nova: Continuous Pig/ Hadoop  Workflows

Independent Scheduling

Different portions of a workflow may be scheduled at different times/rates. - Global link analysis algorithms may only be run occasionally due to their costly nature and consumers‘ tolerance for staleness. - The components that perform ingesting, tagging, indexing new news articles, need to operate continuously.

Page 6: Nova: Continuous Pig/ Hadoop  Workflows

Cross-module optimizationCan identify and exploit certain optimization opportunities. E.g.:- 2 components read the same input data at the same time.- Pipelining: output of one module as input of subsequent module

=> Avoid materializing the intermediate result.

Manageability features- Manage workflow programming, execution.- Support debugging, keep track of versions of workflow

components.- Capture data source and emitting notifications of key events.

Page 7: Nova: Continuous Pig/ Hadoop  Workflows

Workflow Model

Workflow- Two kinds of vertices: tasks (processing steps)

and channels (data containers)- Edges connect tasks to channels and vise versa.

[Task] Consumption mode:ALL: read a complete snapshotNEW: only new data since the last invocation[Task] Production mode:B: new complete snapshotDelta: new data that augments any existed data

Page 8: Nova: Continuous Pig/ Hadoop  Workflows

Workflow Model

[Task] Four common patterns of processing- Non-incremental (template detection): Process

data from scratch every time.

- Stateless incremental (shingling): Process new data only, each data item is handle independently.

- Stateless incremental with lookup table (template tagging): Process new data independently. May use a side loop-up table for reference.

- Stateful incremental (de-duping): Process new data while maintain and reference some state with the prior input data.

Page 9: Nova: Continuous Pig/ Hadoop  Workflows

Workflow Model (Cont.)

Data and Update ModelBlocks: A channel’s data is divided into blocks. They vary in size.- Blocks are atomic units (either be processed entirely or discarded)- Blocks are immutable.

Contains a complete snapshot of data on a channel as of some point in timeBase blocks are assigned increasing sequence numbers(B0,B1,B2……Bn)

Base block

Used in conjunction with incremental processingContains instructions for transforming a base block into a new base block( )

Delta block

( )i j i jB B i j

Page 10: Nova: Continuous Pig/ Hadoop  Workflows

Workflow Model (Cont.)

Data and Update ModelOperators:- Merging: combine base and delta blocks: - Diffing: Compare 2 base blocks to create a delta block

- Chaining: combine multiple delta blocks

Upsert model: Leverages the presence of a primary key attribute to encode updates and inserts in a uniform way. With upserts, delta blocks are comprisedof records to be inserted, with each one displacing any pre-existing record with the same key => retain only the most recent record with a given key.

Page 11: Nova: Continuous Pig/ Hadoop  Workflows

Workflow Model (Cont.)

Task/Data Interface:

[Task] Consumption mode:ALL: read a complete snapshotNEW: only new data since the last invocation[Task] Production mode:B: new complete snapshotDelta: new data that augments any existed data

Page 12: Nova: Continuous Pig/ Hadoop  Workflows

Workflow Model (Cont.)

Workflow Programming and Scheduling

Workflows programming starts with task definitions, then compose them into “workflowettes”. Workflowettes have ports to which input and output channels they may connect. Channels attached to the input and output ports of a workflowette => bound workflowette.

3 types of trigger associated with a workflowette:

Data-based trigger. Time-based trigger. Cascade trigger.

Page 13: Nova: Continuous Pig/ Hadoop  Workflows

Workflow Model (Cont.)

Data blocks are immutable. Channels accumulate data blocks => can grow without bound.

Data Compaction and Garbage Collection If a channel has blocks B0 , , ,

, the compaction operation computes and adds B3 to the channel

After compaction is used to add B3 to the channel , and current cursor is at sequence number 2 , then B0 , ,

can be garbage-collected.

0 1 1 2 2 3

0 1 1 2

Page 14: Nova: Continuous Pig/ Hadoop  Workflows

• Each data block resides in an HDFS file. A metadata maintains the mapping.

• The notion of channel exists only in metadata.

• Each task: a Pig program.

Tying the model to Pig/Hadoop

Page 15: Nova: Continuous Pig/ Hadoop  Workflows

• Each data block resides in an HDFS file. A metadata maintains the mapping.

• The notion of channel exists only in metadata.

Tying the model to Pig/Hadoop

Page 16: Nova: Continuous Pig/ Hadoop  Workflows

Nova System Architecture