Real time ETL processing using Spark streaming
-
Upload
datamantra -
Category
Data & Analytics
-
view
918 -
download
1
Transcript of Real time ETL processing using Spark streaming
![Page 1: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/1.jpg)
Real Time ETL processing
By Veeramani Moorthy
![Page 2: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/2.jpg)
Agenda
Real time ETL Architecture
Why Reconciler?
Reconciler Data model
Q & A?
Requirements for Reconciler
![Page 3: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/3.jpg)
[1.2
.1]
JDB
C F
etch
Tab
le S
chem
a
Trail Files
AdapterRead
GoldenGate
Schema Registry[1.1] Data
Pump
• Schema Registry is a repository of ALL schemas which are versioned.• GoldenGate captures the table change events• Kafka – Distributed Messaging system• CDC – Change Data Capture
[2.1] CDC Events to
broker
Spark Reconciler Spark Joiner
Get Table Schema Get Table Schema
Streaming Reconciler
job
Write output
Reconciled Companies Topic
Source DB
Golden Gate
[1.0] Data Extract
[1.2
] G
et/
Cre
ate
/Up
dat
e Sc
hem
a
Real-Time ETL Architecture
Companies Topic
Addresses Topic
Streaming Joiner/Transfo
rmer Job
Streaming Reconciler
jobReconciled
Addresses Topic
Read/Write for Reconcile Addresses
Read/Write for Reconcile Companies
[3.1] CDC Events to
broker
Streaming Joiner/Transfo
rmer Job
fn
Mapping service
Get Mapping
![Page 4: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/4.jpg)
Requirements for Reconciler
Support for Idempotency
Support for immutability
Support for Schema evolution
Support to handle out of order CDC events
![Page 5: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/5.jpg)
Challenges in Spark streaming
![Page 6: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/6.jpg)
Out of sequence
UPDATE comes first INSERT comes later
![Page 7: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/7.jpg)
Challenges in Spark streaming …
![Page 8: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/8.jpg)
Data model
Tuple Id Source DB Timestamp
Attribute Name Attribute value isDelete?
10201 12345677 company_id 10201 false
10201 12345677 company_name ABC Inc false
10201 12345677 company_addr EGL, BLR false
10201 22345677 company_addr Ecospace, BLR false
….
Company_id Company_name Company_addr
10201 ABC Inc EGL, BLR
….
Instead of
Go with
![Page 9: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/9.jpg)
How does it solve?
Immutability?
Idempotency?
Out of sequence events?
![Page 10: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/10.jpg)
Schema Evolution
Tuple Id Source DB Timestamp
Attribute Name Attribute value isDelete?
10201 12345677 company_id 10201 false
10201 12345677 company_name ABC Inc false
10201 12345677 company_addr EGL, BLR false
10201 22345677 company_addr Ecospace, BLR false
10201 22345900 Registered_name
ABC India Pvt Ltd
false
….
Do I have to change the destination schema?
![Page 11: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/11.jpg)
Schema Evolution
Addition of new column
Deletion of an existing column
Data Type change
![Page 12: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/12.jpg)
![Page 13: Real time ETL processing using Spark streaming](https://reader033.fdocuments.in/reader033/viewer/2022052514/587138bd1a28abf0568b6421/html5/thumbnails/13.jpg)