Luncheon Webinar Series May 13, 2013 Sponsored By -...
Transcript of Luncheon Webinar Series May 13, 2013 Sponsored By -...
0
Luncheon Webinar Series May 13, 2013
InfoSphere DataStage is Big Data Integration
Sponsored By:
Presented by :
Tony Curcio, InfoSphere
Product Management
InfoSphere DataStage is Big Data Integration
Questions and suggestions regarding presentation topics? - send
Downloading the presentation
– Click Presentation YES on Poll Question
– Replay will be available within one day with email with details
Bonus Offer – Free premium membership for your DataStage
Management! Submit your management’s email address and we
will offer him/her access on your behalf.
– Email [email protected] subject line “Managers special”.
– Join us all at Linkedin http://tinyurl.com/DSXmembers
– ISXchange will sponsor Trial membership for new requests at
Linkedin DSX members site
1
© 2013 IBM Corporation
InfoSphere DataStage is Big Data Integration
Tony Curcio
InfoSphere Product Management
3
New types of data stores
Big Data introduces additional data stores that need to be
integrated – both Hadoop based and noSQL based
These data stores don’t easily lend themselves to conventional
methods for data movement
New data types and formats
Unstructured data; poly-structured data stores; JSON, Avro,
and what more to come ???
Video, docs, web logs, …
Larger volumes
Solutions need to move, transform, cleanse and otherwise
prepare huge data volumes
Big Data requires data scalability
Bigger Data Integration Challenges
Speeds Productivity Graphical design easier to use than hand coding
Promotes Object Reuse Build once, share, and run anywhere (etl/elt/real-time)
Simplifies Heterogeneity Common method for diverse data sources
Benefits of InfoSphere DataStage
Reduces Operational Cost Provides a robust framework to manage data integration
Shortens Project Cycles Pre-built components reduce cost and timelines
Protects from Changes isolation from underlying technologies changes as they
continue to evolve
Big Data is part of the Information Supply Chain
Analyze
Integrate
Manage
Business Analytics
Applications
External
Information
Sources
Cubes
Streams
Big Data
Master Data
Content
Data
Streaming
Information Govern
Quality
Security &
Privacy Lifecycle
Data Warehouses
Standards
Transactional
& Collaborative
Applications Content
Information
Governance
5
Gartner Magic Quadrant
“IBM is the only DBMS vendor that can offer an information architecture across the
entire organization, covering information on all systems”
4 Key Analytical Use Cases for Big Data
• Analyze a variety of machine data for improved business results
• Extend existing customer views by incorporating additional information sources
• Integrate big data and data warehouse capabilities to increase operational efficiency
• Find, visualize, understand all big data to improve decision making
Big Data Exploration
Data Warehouse
Augmentation
Operations Analysis
Enhanced 360o View of
the Customer
Integrate big data and data warehouse capabilities to increase operational efficiency
Challenges
• Leveraging structured, unstructured,
and streaming data sources for deep
analysis
• Low latency requirements
• Query access to data
• Optimizing warehouse for big data
volumes
• Metadata management to support
impact analysis and data lineage
Required capabilities
• Data Integration Hub Processing
• High-speed, massively scalable
read from and write to big data
sources and new data
• Big Data Expert
• Automatically build MapReduce
logic through simple data flow
design and coordinate workflow
across traditional and big data
platforms
Data Warehouse Augmentation
Data Integration
Hub Processing
© 2013 IBM Corporation 9
“Connectivity Hub”
InfoSphere
DataStage
Effectively handle the complexity of enterprise information sources
and types with a common design paradigm across
heterogeneous landscape with high-speed scalable solution
to speed the delivery of analytics.
10
Disk
CPU
Memor
y
Sequential
Disk
CPU
Shared
Memory
CPU CPU CPU
4-way Parallel 64-way Parallel
Uniprocessor SMP System MPP Clustered System
Sour
ce
Data
Transfor
m Cleanse Enrich EDW
Dynamic
Instantly get better performance
as hardware resources are
added to any topology
Extendable
Add a new server to scale out
through simple text file edit (or, in
grid config, automatically via
integration with grid management
software).
Data Partitioned
In true MPP fashion (like
Hadoop) data persisted in the
data integration platform is stored
in parallel to scale out the I/O.
Hadoop Integrated
Push all or parts of the process
out to Hadoop to take advantage
of it’s scalability in ELT fashion.
10
InfoSphere DataStage is Big Data Integration
Hadoop Distributed File System massively scalable and resilient storage
11
Big Data Source Types
noSQL (not-only SQL) record storage optimized for read (or write)
noSQL
InfoSphere Streams massive real-time analytics
Available since v8.7 in 2011
Extends the simple flat file
paradigm - just add your hadoop
server name and port number
Parallelization techniques to pipe
data in and out at massive scale
Performance study run up to 5.2
TB/hr before hdfs disks were
complete saturated (5 node
hadoop cluster)
12
Blazing Fast HDFS
Simple data flow design for HDFS
Read from an
HDFS file in
parallel
Transform/
restructure
the data
Join two
HDFS files
Create new
HDFS file,
fully
parallelized
13
• New connectors available on
developerWorks
• Plugs into InfoSphere DataStage and
operates just like any other stage.
• Includes features to exploit specific
data sources
Agile Connector Accelerators for noSQL
14
Open
Code
Sample Job with MongoDB and Hive
Selects what HDFS
data to send down
stream.
Writing data
to Hive
Writing data
to MongoDB
Accepts specific
MongoDB
directives
15
Parsing and composing
of JSON data format
Included advanced
transformation
framework already
provided for XML
capabilities
Beta available on
InfoSphere
DataStage 9.1 FP1
16
Parse and Compose JSON (beta)
Big Data
Expert
© 2013 IBM Corporation 1
8
“Big Data Expert”
InfoSphere
DataStage
Automatically push transformational processing close to where the
data resides, both SQL for DBMS and MapReduce for Hadoop,
leveraging the same simple data flow design process and coordinate
workflow across all platforms
19
New in 9.1, leverage the same UI and the same stages to build
MapReduce.
Drag and drop stages to the canvas to create a job, rather than have to
learn MapReduce programming.
Push the processing to Hadoop for patterns when you don’t want to
transport the data on the network.
Automated MapReduce Job Generation
© 2013 IBM Corporation
Build integration
jobs with the
same data flow
tool and stages
Automatically
creates
MapReduce
code.
Automated MapReduce Job Generation
20
© 2013 IBM Corporation 21
Automated MapReduce Job Generation
Job includes other
database on
separate system
Recognizes what processing
can run natively in Hadoop
and what requires DataStage
engine to move the data
22
clickstream
sensors
transactions
content
JAQL Hive HBase
Masking
Lineage Quality
Optim
Masking
Custom MR
all sources
BigInsights / Hadoop
Operational Warehouse
Zone
Analytics Warehouse
Zone
Replication
ETL
Guardium
Information Server
Architecture for Warehouse Landing Zone
Landing Zone
Use Case Requirements: Data Warehouse Landing Zone Large Scale – large data volumes, scale out requires open MPP platform
Low Cost – low cost storage, compute and commodity hardware
Many Data Types – un/semi structured and social datatype coverage
Many Access Patterns – exploratory, iterative and discovery oriented
Oozie Integration – Same design paradigm for
workflows as for job design.
– Directly call an Oozie activity that is
invoking custom MapReduce code.
End-to-end Workflows – Sequence right alongside other
data integration and analytics
activities
– Allows users to have the data
sourcing, ETL, Analytics and
delivery of information all controlled
through a single process.
– Monitor all stages through
Operations Console’s web based
interace
Combined Workflows for Big Data
23
Understand how traditional and big data sources are being used
Assess impact of change and mitigate risks
Show impact on downstream applications and BI reports
Navigate through impacted areas and drill down
Cross Tool Impact Analysis and Traceability
Wrap-up
New analytic applications drive the
requirements for a big data platform
• Integrate and manage the full
variety, velocity and volume of data
• Apply advanced analytics to
information in its native form
• Visualize all available data for ad-
hoc analysis
• Development environment for
building new analytic applications
• Workload optimization and
scheduling
• Security and Governance
26
The IBM Big Data Platform
Accelerators
Information Integration & Governance
Data
Warehouse
Stream
Computing
Hadoop
System
Discovery Application
Development
Systems
Management
Data Media Content Machine Social
BIG DATA PLATFORM
Integrate & Link Big Data
Master Big Data
Audit & Archive Big Data
Cleanse and Validate Big Data
Protect Big Data
Big Data as a Source
Big Data as a Target
Data Transformations
Data Movement
Integrate w/existing Enterprise
Data Lineage & Impact Analysis
Metadata Integration w/Analytics
Realtime & Data Federation
Activity Monitoring
Data Masking
Data Encryption
On-Demand / In-Place Protection
In-Line Protection (w/ETL etc.)
Active Detection & Alerting
Queryable Archive
Structured and Semi-Structured
Optimized Connectors to existing Apps
Hot-Restorable On-the-Fly
Immutable and Secure Access
Automated Legal Hold Capability for Data
Freeze
Accuracy and Entity Matching
with Social Data
De-duplication and
Standardization of Machine Data
In-line Cleansing with Integration
Trusted Data Dashboard and
Reporting on Data Quality
Big Data as a Supplier
Big Data as a Consumer
Links between Big Data and
Trusted Golden Records
Leverage Master Data in Big
Data Analytics
Entity Resolution at Extreme
Scale Out Levels
Probabilistic Entity Matching
27
Information Integration & Governance for Big Data
29
If you’d like to explore this topic further… – Contact your IBM account team or your preferred IBM Partner.
If you’d like to explore more about InfoSphere DataStage and the
Information Server platform… – http://www-01.ibm.com/software/data/integration/info_server/
If you’re looking for a Enterprise level Hadoop distribution… – InfoSphere Big Insightshttp://www-
01.ibm.com/software/data/infosphere/biginsights/
Where to go for learn more….
Thanks