ETL Framework
description
Transcript of ETL Framework
ETL Framework
Motivations
It is nice to not have jobs fail overnight Automated retries Detect transient failures (e.g. interrupted transfer) Track data quality issues without failing
It is nice to know when data is not right Provide edit messages for problems Fix / sanitise data where possible
It is nice to know what data looked like before the most recent refresh Track versions of rows as they are received
These are not features of SSIS or Talend
Features
Standardise jobs without loss of flexibility Batch tracking Row count tracking Change capture and tracking Globally unique, persistent surrogate keys Edit messages Scheduling & logging
Standardised jobs - Staging
Standardised Jobs - Staging
Having a standard layout and sequence of steps helps when trying to learn what each job does.
Standardised Jobs
All staging tables have a generated primary key comprising batch ID and row number
All staging tables have a reject flag indicating whether reject edits were raised
As rows are staged, edits are checked Edit columns are nullable: if non-null, the
edit was raised and the column contains the message
Non-standard Jobs
Batch Tracking
Call etl_start_batch and etl_stop_batch SP Use the database for all ETL timestamps to get consistent clock
Stop batch optionally performs row count checks (not part of stored proc)
Logs to etl_r_batch table
Row Count Tracking
Row counters track input rows, output rows, rejects and edits – also logged to etl_r_batch
Allows reconciliation to check rows aren’t “lost” through unintended ETL actions
Lets me see how jobs are progressing
Row Numbering
Surrogate keys, batch IDs, row numbers are assigned by a custom Talend component
Change Capture
We want to know when a source system row changes between snapshots Permits investigation of historical reports Helps undo data problems in source Provides triggers for re-evaluating edits
Efficient storage of changes is useful but space is cheap, so it’s not a big constraint
Having three tables (staging, base, history) for each source table has impacts for naming
Change Capture
Use left join from staging to current (yesterday) to pick up insertions (IN) and new versions of updated rows (UI)
Union with left join from current to staging to pick up deletions (DE) and old versions of updated rows (UD)
Null-safe compare on all non-key columns Retain previously allocated skey for updates Allocate new skey for inserts only
Means skeys persist while row remains in source Conserves global pool of skeys (although bigint has plenty!)
Change Capture
SET TRANSACTION isolation level snapshot;
SELECT etl_batch_id, etl_skey, etl_chg_type, [etl_reject_flag], [map_name], [code], [cond_desc], [full_desc], [mapped_code] FROM (SELECT CASE WHEN dst.etl_batch_id IS NOT NULL THEN 'UI' ELSE 'IN' END AS etl_chg_type, dst.etl_skey, src.etl_batch_id, src.[etl_reject_flag], dst.[etl_reject_flag] AS [etl_cmp_etl_reject_flag], src.[map_name], dst.[map_name] AS [etl_cmp_map_name], src.[code], dst.[code] AS [etl_cmp_code], src.[cond_desc], dst.[cond_desc] AS [etl_cmp_cond_desc], src.[full_desc], dst.[full_desc] AS [etl_cmp_full_desc], src.[mapped_code], dst.[mapped_code] AS [etl_cmp_mapped_code] FROM [dbo].[ip_s_code_map] AS src LEFT JOIN [dbo].[ip_r_code_map] AS dst ON src.[map_name] = dst.[map_name] AND src.[code] = dst.[code]
Note that if you process changes while the query runs, you need to use snapshot isolation to avoid the second half picking up applied changes
Using separate rows for UI & UD means you do not need to double the number of columns in your output or change nullability of columns.
Left join staging to current to find new rows or the new version of existing rows.
Change CaptureUNION ALL SELECT CASE WHEN dst.etl_batch_id IS NOT NULL THEN 'UD' ELSE 'DE' END AS etl_chg_type, src.etl_skey, src.etl_batch_id, src.[etl_reject_flag], dst.[etl_reject_flag] AS [etl_cmp_etl_reject_flag], src.[map_name], dst.[map_name] AS [etl_cmp_map_name], src.[code], dst.[code] AS [etl_cmp_code], src.[cond_desc], dst.[cond_desc] AS [etl_cmp_cond_desc], src.[full_desc], dst.[full_desc] AS [etl_cmp_full_desc], src.[mapped_code], dst.[mapped_code] AS [etl_cmp_mapped_code] FROM [dbo].[ip_r_code_map] AS src LEFT JOIN [dbo].[ip_s_code_map] AS dst ON src.[map_name] = dst.[map_name] AND src.[code] = dst.[code]) AS cmp WHERE CASE WHEN etl_chg_type NOT IN ( 'IN', 'DE' ) AND ( [etl_reject_flag] = [etl_cmp_etl_reject_flag] ) AND ( [cond_desc] = [etl_cmp_cond_desc] OR ([cond_desc] IS NULL AND [etl_cmp_cond_desc] IS NULL)) AND ( [full_desc] = [etl_cmp_full_desc] ) AND ( [mapped_code] = [etl_cmp_mapped_code] ) THEN 1 ELSE 0 END = 0
Left join current to staging to find deleted rows or the old version of existing rows.
Use of CASE helps guard against accidental NULL-unsafe comparisons
Change Capture
First batch is all inserts Subsequently rows may be updated or deleted, and new
rows inserted Outgoing data is also present
Useful to join UI & UD to see effect of update
etl_batch_id etl_skey etl_chg_type
etl_reject_flag
map_name code cond_desc full_desc mapped_code
102274 276360026 IN 0 Delivery Mode
102274 276360027 IN 0 Delivery Mode 1 Y Face-to-face 1
102274 276360028 IN 0 Delivery Mode 2 Y Telephone 2
102274 276360029 IN 0 Delivery Mode 3 Y Telehealth 3
102274 276360030 IN 0 Delivery Mode 4 Y Written 4
102274 276360031 IN 0 Delivery Mode 9 Y Not Applicable 9
129484 276360030 UD 0 Delivery Mode 4 Y Written 4
129484 276360030 UI 0 Delivery Mode 4 N Written 4
129484 276360031 DE 0 Delivery Mode 9 Y Not Applicable 9
An Aside: Naming Schemes
I don’t use tbl_, vw_, sp_, etc Names tend to be right-trimmed, left-most characters are precious Names tend to be prefix-sorted, left-most characters should group You can fairly easily work it out without it If I decide to replace a table with a view (it happens!) it looks silly to
have tbl_ in a view name – corollary is that “what it is” is less relevant than “how do you use it” – eg does it return a row set on SELECT?
I start the name with a functional group E.g. ip_, op_, rt_, pl_, ph_, wl_, ad_
I use a letter to indicate the workflow role c = hand entered config, s = staging (truncated daily), r = repository
(current records), h = history (change tracking records), f = fact, d = dimension, a = aggregation, m = materialized view (refreshed daily)
I try to name columns consistently across systems
Standardised jobs - Merging
Surrogate Keys
A globally unique key for all rows Allows me to have a single edits table Allows easy merging of different data sources Makes it easier to identify a row in queries (vs.
compound or non-integer keys) Issued monotonically – suitable for clustered
indexes Not used in staging
Allocated after change capture has identified the new rows that need a new skey
Can remain attached to the row while cardinality doesn’t change (e.g. into facts but not aggregations)
Edit Messages
Writing data quality messages means you can massage bad data to load it, without losing track of what you originally received
Capturing your understanding of the data in DQ messages helps to provide living documentation of key business concepts
Using change capture, edits can be regenerated only when the source data changes
Other Logging
Scheduler logs job output and error streams to database
Jobs can log details (e.g. file name processed, date range, etc) to journal
Jobs can create log messagesbut mostly don’t, relying on stdout/stderr
Web Job Monitor
Web Job Monitor is underdevelopedShould be integrated to scheduler and allow
more job control Provides a web-based listing of job
outcomes and logs
Web Job Monitor
Status, run times, rows processed, etc Links to detail screen for each job (next slide)
Web Job Monitor
Edits can be acknowledged to hide them from data quality reports
Acknowledgement persists until row changes in source (as detected by change capture)
Scheduler
Uses Oddjob http://www.rgordon.co.uk/oddjob/
Talend does not easily non-trivial dependencies for job sequences E.g. Six independent reference tables can run in
parallel, but then the transaction table load must wait for all to succeed
Talend also makes you pay for scheduler SQL Server Agent requires SQL licence Task Scheduler is not sophisticated enough
Job Stream StructureThis node is a link to the underlying configuration file DI_All.xml and needs to be running for the daily schedule to be active.
This sequence is run (and keeps running) to load the configuration and the schedule.
This job represents the daily schedule: when it is a blue dot it is “running” and will trigger.
Sequential job containing the phases.
Sequential job representing the first phase.
Sequential job within a phase representing the logical step (stage + merge bookings).
Talend jobs that perform the actual stage and merge processing.
Batch job corresponding to the master job.
Future
I have broken every rule I have created It would be nice to reconsider some of those
The change capture query builder could be turned into a component Hard part is getting the schema to load into the
Talend designer Edit messages should be checked and reloaded
if the edit message changes Web Job Monitor needs love