Luncheon Webinar Series May 13, 2013 Sponsored By -...

0

Luncheon Webinar Series May 13, 2013

InfoSphere DataStage is Big Data Integration

Sponsored By:

Presented by :

Tony Curcio, InfoSphere

Product Management


Questions and suggestions regarding presentation topics? - send

to [email protected]

Downloading the presentation

– Click Presentation YES on Poll Question

– Replay will be available within one day with email with details

Bonus Offer – Free premium membership for your DataStage

Management! Submit your management’s email address and we

will offer him/her access on your behalf.

– Email [email protected] subject line “Managers special”.

– Join us all at Linkedin http://tinyurl.com/DSXmembers

– ISXchange will sponsor Trial membership for new requests at

Linkedin DSX members site

1

mailto:[email protected]

http://tinyurl.com/DSXmembers

© 2013 IBM Corporation


Tony Curcio

InfoSphere Product Management

3

New types of data stores

Big Data introduces additional data stores that need to be

integrated – both Hadoop based and noSQL based

These data stores don’t easily lend themselves to conventional

methods for data movement

New data types and formats

Unstructured data; poly-structured data stores; JSON, Avro,

and what more to come ???

Video, docs, web logs, …

Larger volumes

Solutions need to move, transform, cleanse and otherwise

prepare huge data volumes

Big Data requires data scalability

Bigger Data Integration Challenges

Speeds Productivity Graphical design easier to use than hand coding

Promotes Object Reuse Build once, share, and run anywhere (etl/elt/real-time)

Simplifies Heterogeneity Common method for diverse data sources

Benefits of InfoSphere DataStage

Reduces Operational Cost Provides a robust framework to manage data integration

Shortens Project Cycles Pre-built components reduce cost and timelines

Protects from Changes isolation from underlying technologies changes as they

continue to evolve

Big Data is part of the Information Supply Chain

Analyze

Integrate

Manage

Business Analytics

Applications

External

Information

Sources

Cubes

Streams

Big Data

Master Data

Content

Data

Streaming

Information Govern

Quality

Security &

Privacy Lifecycle

Data Warehouses

Standards

Transactional

& Collaborative

Applications Content

Information

Governance

5

Gartner Magic Quadrant

“IBM is the only DBMS vendor that can offer an information architecture across the

entire organization, covering information on all systems”

4 Key Analytical Use Cases for Big Data

• Analyze a variety of machine data for improved business results

• Extend existing customer views by incorporating additional information sources

• Integrate big data and data warehouse capabilities to increase operational efficiency

• Find, visualize, understand all big data to improve decision making

Big Data Exploration

Data Warehouse

Augmentation

Operations Analysis

Enhanced 360o View of

the Customer

Integrate big data and data warehouse capabilities to increase operational efficiency

Challenges

• Leveraging structured, unstructured,

and streaming data sources for deep

analysis

• Low latency requirements

• Query access to data

• Optimizing warehouse for big data

volumes

• Metadata management to support

impact analysis and data lineage

Required capabilities

• Data Integration Hub Processing

• High-speed, massively scalable

read from and write to big data

sources and new data

• Big Data Expert

• Automatically build MapReduce

logic through simple data flow

design and coordinate workflow

across traditional and big data

platforms

Data Warehouse Augmentation

Data Integration

Hub Processing

© 2013 IBM Corporation 9

“Connectivity Hub”

InfoSphere

DataStage

Effectively handle the complexity of enterprise information sources

and types with a common design paradigm across

heterogeneous landscape with high-speed scalable solution

to speed the delivery of analytics.

10

Disk

CPU

Memor

y

Sequential

Disk

CPU

Shared

Memory

CPU CPU CPU

4-way Parallel 64-way Parallel

Uniprocessor SMP System MPP Clustered System

Sour

ce

Data

Transfor

m Cleanse Enrich EDW

Dynamic

Instantly get better performance

as hardware resources are

added to any topology

Extendable

Add a new server to scale out

through simple text file edit (or, in

grid config, automatically via

integration with grid management

software).

Data Partitioned

In true MPP fashion (like

Hadoop) data persisted in the

data integration platform is stored

in parallel to scale out the I/O.

Hadoop Integrated

Push all or parts of the process

out to Hadoop to take advantage

of it’s scalability in ELT fashion.

10


Hadoop Distributed File System massively scalable and resilient storage

11

Big Data Source Types

noSQL (not-only SQL) record storage optimized for read (or write)

noSQL

InfoSphere Streams massive real-time analytics

Available since v8.7 in 2011

Extends the simple flat file

paradigm - just add your hadoop

server name and port number

Parallelization techniques to pipe

data in and out at massive scale

Performance study run up to 5.2

TB/hr before hdfs disks were

complete saturated (5 node

hadoop cluster)

12

Blazing Fast HDFS

Simple data flow design for HDFS

Read from an

HDFS file in

parallel

Transform/

restructure

the data

Join two

HDFS files

Create new

HDFS file,

fully

parallelized

13

• New connectors available on

developerWorks

• Plugs into InfoSphere DataStage and

operates just like any other stage.

• Includes features to exploit specific

data sources

Agile Connector Accelerators for noSQL

14

Open

Code

Sample Job with MongoDB and Hive

Selects what HDFS

data to send down

stream.

Writing data

to Hive

Writing data

to MongoDB

Accepts specific

MongoDB

directives

15

Parsing and composing

of JSON data format

Included advanced

transformation

framework already

provided for XML

capabilities

Beta available on

InfoSphere

DataStage 9.1 FP1

16

Parse and Compose JSON (beta)

Big Data

Expert


8

“Big Data Expert”

InfoSphere

DataStage

Automatically push transformational processing close to where the

data resides, both SQL for DBMS and MapReduce for Hadoop,

leveraging the same simple data flow design process and coordinate

workflow across all platforms

19

New in 9.1, leverage the same UI and the same stages to build

MapReduce.

Drag and drop stages to the canvas to create a job, rather than have to

learn MapReduce programming.

Push the processing to Hadoop for patterns when you don’t want to

transport the data on the network.

Automated MapReduce Job Generation

© 2013 IBM Corporation

Build integration

jobs with the

same data flow

tool and stages

Automatically

creates

MapReduce

code.


20



Job includes other

database on

separate system

Recognizes what processing

can run natively in Hadoop

and what requires DataStage

engine to move the data

22

clickstream

sensors

transactions

content

JAQL Hive HBase

Masking

Lineage Quality

Optim

Masking

Custom MR

all sources

BigInsights / Hadoop

Operational Warehouse

Zone

Analytics Warehouse

Zone

Replication

ETL

Guardium

Information Server

Architecture for Warehouse Landing Zone

Landing Zone

Use Case Requirements: Data Warehouse Landing Zone Large Scale – large data volumes, scale out requires open MPP platform

Low Cost – low cost storage, compute and commodity hardware

Many Data Types – un/semi structured and social datatype coverage

Many Access Patterns – exploratory, iterative and discovery oriented

Oozie Integration – Same design paradigm for

workflows as for job design.

– Directly call an Oozie activity that is

invoking custom MapReduce code.

End-to-end Workflows – Sequence right alongside other

data integration and analytics

activities

– Allows users to have the data

sourcing, ETL, Analytics and

delivery of information all controlled

through a single process.

– Monitor all stages through

Operations Console’s web based

interace

Combined Workflows for Big Data

23

Understand how traditional and big data sources are being used

Assess impact of change and mitigate risks

Show impact on downstream applications and BI reports

Navigate through impacted areas and drill down

Cross Tool Impact Analysis and Traceability

Wrap-up

New analytic applications drive the

requirements for a big data platform

• Integrate and manage the full

variety, velocity and volume of data

• Apply advanced analytics to

information in its native form

• Visualize all available data for ad-

hoc analysis

• Development environment for

building new analytic applications

• Workload optimization and

scheduling

• Security and Governance

26

The IBM Big Data Platform

Accelerators

Information Integration & Governance

Data

Warehouse

Stream

Computing

Hadoop

System

Discovery Application

Development

Systems

Management

Data Media Content Machine Social

BIG DATA PLATFORM

Integrate & Link Big Data

Master Big Data

Audit & Archive Big Data

Cleanse and Validate Big Data

Protect Big Data

Big Data as a Source

Big Data as a Target

Data Transformations

Data Movement

Integrate w/existing Enterprise

Data Lineage & Impact Analysis

Metadata Integration w/Analytics

Realtime & Data Federation

Activity Monitoring

Data Masking

Data Encryption

On-Demand / In-Place Protection

In-Line Protection (w/ETL etc.)

Active Detection & Alerting

Queryable Archive

Structured and Semi-Structured

Optimized Connectors to existing Apps

Hot-Restorable On-the-Fly

Immutable and Secure Access

Automated Legal Hold Capability for Data

Freeze

Accuracy and Entity Matching

with Social Data

De-duplication and

Standardization of Machine Data

In-line Cleansing with Integration

Trusted Data Dashboard and

Reporting on Data Quality

Big Data as a Supplier

Big Data as a Consumer

Links between Big Data and

Trusted Golden Records

Leverage Master Data in Big

Data Analytics

Entity Resolution at Extreme

Scale Out Levels

Probabilistic Entity Matching

27

Information Integration & Governance for Big Data

29

If you’d like to explore this topic further… – Contact your IBM account team or your preferred IBM Partner.

If you’d like to explore more about InfoSphere DataStage and the

Information Server platform… – http://www-01.ibm.com/software/data/integration/info_server/

If you’re looking for a Enterprise level Hadoop distribution… – InfoSphere Big Insightshttp://www-

01.ibm.com/software/data/infosphere/biginsights/

Where to go for learn more….

http://www-01.ibm.com/software/data/integration/info_server/




http://w3.tap.ibm.com/medialibrary/media_view?id=178596





Thanks

Luncheon Webinar Series May 13, 2013 Sponsored By -...

Documents

Transcript of Luncheon Webinar Series May 13, 2013 Sponsored By -...