CDC DataStage Integration Options R3 - IBM · PDF fileDataStage and InfoSphere QualityStage...
Transcript of CDC DataStage Integration Options R3 - IBM · PDF fileDataStage and InfoSphere QualityStage...
© 2010 IBM Corporation
InfoSphere CDC To DataStage Integration Options
Information Management Software
2
Business Challenges Driving Real-Time Data Integration
……Without Impacting the Performance of Production Sys tems
• Yesterday’s data inadequate for inventory and purchasing decisions
• We need up to date information flowing between applications and to ensure an up-to-date version is always available
• Need to pro-actively monitor and respond to business changes
Dynamic Warehousing & Business Intelligence and Reporting
Real-time Event Detection
Data Synchronization and Replication
Information Management Software
3
Accelerate capture and delivery of data changes for ETL optimization or event-driven data quality
Database Database Database.. ..
• InfoSphere Change Data Captureprovides low impact, log-based changed data capture and rapid delivery of changes
• Direct integration with InfoSphere DataStage and InfoSphere QualityStage through flat files, direct connection, message queues, or staging tables
• Extremely low impact on sourcing for ETL processing into data warehouse
• Leverage existing data ETL and data cleansing investments
IBM Information Server
Data changes for ETL and data cleansing
Change Data Capture
Information Management Software
4
Differentiators
Integrated with InfoSphere Information Server Benefits
Technology integrated to feed real-time changed data into InfoSphere Information Server
Extend existing InfoSphere Information Server functionality with real-time data feeds
High Performance
Optimized native, log-based change data capture without staging on the source
Fast and efficient; no additional hardware; no changes to databases/applications
Less invasive to data sources and network bandwidth than alternative solutions
Low impact to performance of source databases
Transactional Integrity
Fault tolerant architecture maintains consistency and recovery
Lower risk by ensuring data integrity
Breadth of Coverage
DB2 z/LUW/iSeries, Oracle, Sybase, SQL Server, Informix, IMS, VSAM, ADABAS, IDMS
Leverage existing investments
Information Management Software
5
Four Different Integration Options
• Via Database Staging
• MQ Series Integration
• Flat File Integration
• Direct Connect
Greater flexibility to choose whichever option best fits your environment and business requirements
Information Management Software
6
InfoSphere CDC & InfoSphere DataStage (ETL)
Native
LogDB
Retail
Point Of Sale
“CDC”Continuous
IBM Information Server
Staging Table
Message Queue
Direct Connect
Flat File
Data Stage Consumption
ETL Load
Oracle
Information S
erverC
hange Data C
apture
IBM Information Server EDW
Out of the box
Out of the box
DataStage DSX file format
TCP via Data Stage operator
Teradata, DB2, Oracle, SQL Server, Sybase…
Including BalOp (ELT)
Information Management Software
7
1. DataStage extracts data for initial load using standard ETL functions2. CDC continuously captures changes made to source database3. CDC continuously writes changes to a set of staging tables using Live Audit
mappings4. DataStage reads the changes from the staging tables, transforms and
cleans the data as needed5. Update target database with changes6. Update internal tracking with last CDC bookmark processed
Ideal for:• Low Latency (minutes)• High data volumes (thousands of rows per second)• Any number of tables
CDC � DataStage Option 1: Database Staging
2 5
3 stagingarea DS/QS job
4
1database database
InfoSphere
CDC
Information Management Software
8
1. DataStage extracts data for initial load using standard ETL functions2. CDC continuously captures changes made to remote database3. CDC continuously writes change messages to MQ via CDC event
server target4. DataStage (via MQ connector) processes messages and passes data
off to downstream stages5. Updates written to target database
Ideal for:• Near real-time integration (seconds)• Low data volumes (hundreds of changes per second)• When infrastructure utilizes MQ Series
CDC � DataStage Option 2: MQ Based integration
2 5
3DS/QS job
4
1database database
MQInfoSphere
CDC
Information Management Software
9
1. DataStage extracts data for initial load using standard ETL functions or CDC can be used for refresh
2. CDC continuously captures changes made to source database3. CDC DataStage writes one file per table and periodically hardens the
files4. DataStage reads the changes from the complete files5. Update target database with changes
Ideal for:• Medium latency (a few minutes or more between periodic batches)• Very High data volumes requiring parallel loading• Up to hundreds of tables
CDC � DataStage Option 3: File Based
2 5
3File DS/QS job
4
1database database
InfoSphere
CDC
1
Information Management Software
10
1. DataStage extracts data for initial load using standard ETL functions or CDC can be used for the refresh
2. CDC continuously captures changes made to source database and flows over TCP/IP to CDC Transaction Stage
3. CDC Transaction Stage passes data off to downstream stages4. Updates target database with changed data. Bookmark persisted in the target
database along with the client data to maintain end-to-end transactional integrity5. Bookmark flows back to CDC source periodically, and at start of replication
Ideal for:• Near real-time integration (seconds)• Medium data volumes (hundreds to low thousands of rows per second)• Less than 150 tablesShould not be used for targeting Netezza
CDC � DataStage Option 4: Direct Connect
CDC
1 4
5
2
DS/QS job
databasedatabase
SourceCDC Transaction Stage
Database Connector Stage
CDC
DataStageTarget
1
3
5
2
2
Information Management Software
11
?
Questions?