ETL Framework

Motivations

It is nice to not have jobs fail overnight Automated retries Detect transient failures (e.g. interrupted transfer) Track data quality issues without failing

It is nice to know when data is not right Provide edit messages for problems Fix / sanitise data where possible

It is nice to know what data looked like before the most recent refresh Track versions of rows as they are received

These are not features of SSIS or Talend

Features

Standardise jobs without loss of flexibility Batch tracking Row count tracking Change capture and tracking Globally unique, persistent surrogate keys Edit messages Scheduling & logging

Standardised jobs - Staging

Standardised Jobs - Staging

Having a standard layout and sequence of steps helps when trying to learn what each job does.

Standardised Jobs

All staging tables have a generated primary key comprising batch ID and row number

All staging tables have a reject flag indicating whether reject edits were raised

As rows are staged, edits are checked Edit columns are nullable: if non-null, the

edit was raised and the column contains the message

Non-standard Jobs

Batch Tracking

Call etl_start_batch and etl_stop_batch SP Use the database for all ETL timestamps to get consistent clock

Stop batch optionally performs row count checks (not part of stored proc)

Logs to etl_r_batch table

Row Count Tracking

Row counters track input rows, output rows, rejects and edits – also logged to etl_r_batch

Allows reconciliation to check rows aren’t “lost” through unintended ETL actions

Lets me see how jobs are progressing

Row Numbering

Surrogate keys, batch IDs, row numbers are assigned by a custom Talend component

Change Capture

We want to know when a source system row changes between snapshots Permits investigation of historical reports Helps undo data problems in source Provides triggers for re-evaluating edits

Efficient storage of changes is useful but space is cheap, so it’s not a big constraint

Having three tables (staging, base, history) for each source table has impacts for naming

Change Capture

Use left join from staging to current (yesterday) to pick up insertions (IN) and new versions of updated rows (UI)

Union with left join from current to staging to pick up deletions (DE) and old versions of updated rows (UD)

Null-safe compare on all non-key columns Retain previously allocated skey for updates Allocate new skey for inserts only

Means skeys persist while row remains in source Conserves global pool of skeys (although bigint has plenty!)

Change Capture

SET TRANSACTION isolation level snapshot;

SELECT etl_batch_id, etl_skey, etl_chg_type, [etl_reject_flag], [map_name], [code], [cond_desc], [full_desc], [mapped_code] FROM (SELECT CASE WHEN dst.etl_batch_id IS NOT NULL THEN 'UI' ELSE 'IN' END AS etl_chg_type, dst.etl_skey, src.etl_batch_id, src.[etl_reject_flag], dst.[etl_reject_flag] AS [etl_cmp_etl_reject_flag], src.[map_name], dst.[map_name] AS [etl_cmp_map_name], src.[code], dst.[code] AS [etl_cmp_code], src.[cond_desc], dst.[cond_desc] AS [etl_cmp_cond_desc], src.[full_desc], dst.[full_desc] AS [etl_cmp_full_desc], src.[mapped_code], dst.[mapped_code] AS [etl_cmp_mapped_code] FROM [dbo].[ip_s_code_map] AS src LEFT JOIN [dbo].[ip_r_code_map] AS dst ON src.[map_name] = dst.[map_name] AND src.[code] = dst.[code]

Note that if you process changes while the query runs, you need to use snapshot isolation to avoid the second half picking up applied changes

Using separate rows for UI & UD means you do not need to double the number of columns in your output or change nullability of columns.

Left join staging to current to find new rows or the new version of existing rows.

Change CaptureUNION ALL SELECT CASE WHEN dst.etl_batch_id IS NOT NULL THEN 'UD' ELSE 'DE' END AS etl_chg_type, src.etl_skey, src.etl_batch_id, src.[etl_reject_flag], dst.[etl_reject_flag] AS [etl_cmp_etl_reject_flag], src.[map_name], dst.[map_name] AS [etl_cmp_map_name], src.[code], dst.[code] AS [etl_cmp_code], src.[cond_desc], dst.[cond_desc] AS [etl_cmp_cond_desc], src.[full_desc], dst.[full_desc] AS [etl_cmp_full_desc], src.[mapped_code], dst.[mapped_code] AS [etl_cmp_mapped_code] FROM [dbo].[ip_r_code_map] AS src LEFT JOIN [dbo].[ip_s_code_map] AS dst ON src.[map_name] = dst.[map_name] AND src.[code] = dst.[code]) AS cmp WHERE CASE WHEN etl_chg_type NOT IN ( 'IN', 'DE' ) AND ( [etl_reject_flag] = [etl_cmp_etl_reject_flag] ) AND ( [cond_desc] = [etl_cmp_cond_desc] OR ([cond_desc] IS NULL AND [etl_cmp_cond_desc] IS NULL)) AND ( [full_desc] = [etl_cmp_full_desc] ) AND ( [mapped_code] = [etl_cmp_mapped_code] ) THEN 1 ELSE 0 END = 0

Left join current to staging to find deleted rows or the old version of existing rows.

Use of CASE helps guard against accidental NULL-unsafe comparisons

Change Capture

First batch is all inserts Subsequently rows may be updated or deleted, and new

rows inserted Outgoing data is also present

Useful to join UI & UD to see effect of update

etl_batch_id etl_skey etl_chg_type

etl_reject_flag

map_name code cond_desc full_desc mapped_code

102274 276360026 IN 0 Delivery Mode

102274 276360027 IN 0 Delivery Mode 1 Y Face-to-face 1

102274 276360028 IN 0 Delivery Mode 2 Y Telephone 2

102274 276360029 IN 0 Delivery Mode 3 Y Telehealth 3

102274 276360030 IN 0 Delivery Mode 4 Y Written 4

102274 276360031 IN 0 Delivery Mode 9 Y Not Applicable 9

129484 276360030 UD 0 Delivery Mode 4 Y Written 4

129484 276360030 UI 0 Delivery Mode 4 N Written 4

129484 276360031 DE 0 Delivery Mode 9 Y Not Applicable 9

An Aside: Naming Schemes

I don’t use tbl_, vw_, sp_, etc Names tend to be right-trimmed, left-most characters are precious Names tend to be prefix-sorted, left-most characters should group You can fairly easily work it out without it If I decide to replace a table with a view (it happens!) it looks silly to

have tbl_ in a view name – corollary is that “what it is” is less relevant than “how do you use it” – eg does it return a row set on SELECT?

I start the name with a functional group E.g. ip_, op_, rt_, pl_, ph_, wl_, ad_

I use a letter to indicate the workflow role c = hand entered config, s = staging (truncated daily), r = repository

(current records), h = history (change tracking records), f = fact, d = dimension, a = aggregation, m = materialized view (refreshed daily)

I try to name columns consistently across systems

Standardised jobs - Merging

Surrogate Keys

A globally unique key for all rows Allows me to have a single edits table Allows easy merging of different data sources Makes it easier to identify a row in queries (vs.

compound or non-integer keys) Issued monotonically – suitable for clustered

indexes Not used in staging

Allocated after change capture has identified the new rows that need a new skey

Can remain attached to the row while cardinality doesn’t change (e.g. into facts but not aggregations)

Edit Messages

Writing data quality messages means you can massage bad data to load it, without losing track of what you originally received

Capturing your understanding of the data in DQ messages helps to provide living documentation of key business concepts

Using change capture, edits can be regenerated only when the source data changes

Other Logging

Scheduler logs job output and error streams to database

Jobs can log details (e.g. file name processed, date range, etc) to journal

Jobs can create log messagesbut mostly don’t, relying on stdout/stderr

Web Job Monitor

Web Job Monitor is underdevelopedShould be integrated to scheduler and allow

more job control Provides a web-based listing of job

outcomes and logs

Web Job Monitor

Status, run times, rows processed, etc Links to detail screen for each job (next slide)

Web Job Monitor

Edits can be acknowledged to hide them from data quality reports

Acknowledgement persists until row changes in source (as detected by change capture)

Scheduler

Uses Oddjob http://www.rgordon.co.uk/oddjob/

Talend does not easily non-trivial dependencies for job sequences E.g. Six independent reference tables can run in

parallel, but then the transaction table load must wait for all to succeed

Talend also makes you pay for scheduler SQL Server Agent requires SQL licence Task Scheduler is not sophisticated enough

Job Stream StructureThis node is a link to the underlying configuration file DI_All.xml and needs to be running for the daily schedule to be active.

This sequence is run (and keeps running) to load the configuration and the schedule.

This job represents the daily schedule: when it is a blue dot it is “running” and will trigger.

Sequential job containing the phases.

Sequential job representing the first phase.

Sequential job within a phase representing the logical step (stage + merge bookings).

Talend jobs that perform the actual stage and merge processing.

Batch job corresponding to the master job.

Future

I have broken every rule I have created It would be nice to reconsider some of those

The change capture query builder could be turned into a component Hard part is getting the schema to load into the

Talend designer Edit messages should be checked and reloaded

if the edit message changes Web Job Monitor needs love

ETL Framework

Documents

Transcript of ETL Framework