Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

18
Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB

Transcript of Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Big Data for Dummies using DataStageBig Data for Dummies using DataStage

By Peter BjelvertInfoSphere Architect

Middlecon AB

ETL – Relational DB

Extract Transform in DataStage

Load

Your powerful DataStage server will handle all complex transformation and the database is only used for reading and writing.

ELT – Relationel DB

ExtractLoad with Transform

If you have powerful Database servers you can push down much of the work to the database, then DataStage will mostly control the flow

Balanced Optimization

Bal. Opt. create a second copy of the jobb that push everything into target. Creates one big SQL statement.

Bal. Opt. creates a new copy of the jobb that push the load into Source and Target

Use DataStage Balanced Optimization to select how to push the load: -To Source-To Target -To Both

The DataStage job is re-written into SQL code.

ETL Balanced Optimization feature of Datastage

ELT – PushDown

DB DataStage is doing the main work

Bal. Opt. creates a new copy of the job with SQL code:SELECT * FROM (SELECT distinct BRANCH_CITY, BRANCH_STATE, BRANCH_ZIP FROM JK_BANK2.BANK_BRANCH) AS A, ( Select distinct BRANCH_CITY,

DB server is doing the main job

Hadoop Distributed File System - HDFS

Application Layer

Workload mgmt Layer

Data Layer

One file3 copies

MapReduce example

Hadoop application stack

Application Layer

Workload mgmt Layer

Data LayerHDFS

MapReduce

JACL, AQL….

IBM’s Hadoop implementation

ETL – HDFS

Extract Transform in DataStage Load

HDFS

Node

Node

Node

Node

Node

Node

HDFS

Node

Node

Node

Node

Node

Node

Your powerful DataStage server can read and write to the distributed file system

DataStage HDFS example

Read and write to a Hadoop system using the new BDFS stage

ELT – Hadoop system

Extract

Use DataStage Balanced Optimization to select how to push the load: -To Source-To Target -To Both

The DataStage job is re-written into JACL code.

Load with Transform

Hadoop

Node

Node

Node

Node

Node

Node

Hadoop

Node

Node

Node

Node

Node

Node

DataStage JACL example

Bal. Opt. create a second copy of the jobb that push everything into target. Creates one big JACL statement.

ETL Balanced Optimization feature of Datastage

ELT – PushDown

DB DataStage is doing the main work

Bal. Opt. creates a new copy of the job with SQL code:SELECT * FROM (SELECT distinct BRANCH_CITY, BRANCH_STATE, BRANCH_ZIP FROM JK_BANK2.BANK_BRANCH) AS A, ( Select distinct BRANCH_CITY,

DB server is doing the main job

HDFS DataStage is doing the main work Bal. Opt. creates a new copy of the job with JACL code: SetOptions({conf:{"mapred.job.name":"DataStage BalOp job BIGDATA:dstage1 ff_read_write_to_hadoop_jaql_balopt_join CustomerTarget 16_#DSJobInvocationId#"}}); setOptions({conf:{"mapred.reduce.tasks":1}}));

Hadoop application server execute the JACL code onall nodes.

BDFS

Node

Node

Node

Node

Node

NodeHadoop

Node

Node

Node

Node

Node

Node

Extract, Transform and filter in DataStage Load good data into HDFS

BDFS

Node

Node

Node

Node

Node

Node

DataStage can read from many different sources. Convert common data (like time/date) to failitate following queries. Send unwanted data to garbage

A good scenario for DS customer

Analytic functionsAQL …

o LIVE DEMO

Handling Big Data without angst