SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS...

12
SAS Bulk Loader of RDBMS data Vlad Svirsky, KVtech Corporation SAS Alliance Partner SAS Bulk Loader of RDBMS data ........................................................................................... 1 Genesis and Design of the system .................................................................................... 1 SAS Bulk Loader (SBL) Main Features ........................................................................... 2 SBL System Design and Implementation Overview ........................................................ 3 Figure 1: SBL System Design................................................................................... 3 SBL modules ..................................................................................................................... 4 1. driver-list.pl module.............................................................................................. 5 Figure 2: driver-list.pl module .................................................................................. 5 2. ftp-files.pl module ................................................................................................. 5 Figure 3: Contents of tables-partitions file generated by ftp-files.pl module ........... 6 Figure 4: File tree associated with ftp-files.pl module ............................................ 7 3. gen-sas-code.pl module......................................................................................... 7 Figure 5: File tree associated with gen-sas-code.pl module ..................................... 8 SAS code example 1: ................................................................................................ 9 4. gen-sas-sessions.pl module ................................................................................. 10 Figure 6: Process Flow for a runid.......................................................................... 11 5. gen-code-cluster-create-spds-tables.pl module ................................................... 12 Conclusion ...................................................................................................................... 12 Contact Information: ....................................................................................................... 12 Genesis and Design of the system An enterprise maintains an extremely large Oracle enterprise data warehouse (EDW) containing hundreds of tables, many of which are partitioned because of the tables’ sizes. The EDW is refreshed with new data on a monthly basis. Also, quite often there are changes to the EDW schema (new tables added, columns added/removed from existing tables, etc.). This enterprise also needs to easily build and maintain several SAS analytics warehouses/data marts for different analysis purposes, each of which is composed of a different subset of tables from the EDW. The regular monthly EDW refresh process is staggered over a relatively prolonged period of time, and different sets of tables being refreshed are completed and become ready and are available for queries at different dates. The above-described scenario of EDW operation necessitates each SAS application that uses SAS/ACCESS to Oracle to constantly verify and modify its code. Bearing in mind the fact that many applications – not just SAS – do the same things, usually competing for the same data, one result is the generation of a very large number of connections to the Oracle server, producing a large load on that server. Large Data Sets NESUG 2011

Transcript of SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS...

Page 1: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

SAS Bulk Loader of RDBMS dataVlad Svirsky, KVtech Corporation

SAS Alliance Partner

SAS Bulk Loader of RDBMS data ........................................................................................... 1Genesis and Design of the system .................................................................................... 1SAS Bulk Loader (SBL) Main Features ........................................................................... 2SBL System Design and Implementation Overview ........................................................ 3

Figure 1: SBL System Design................................................................................... 3SBL modules..................................................................................................................... 4

1. driver-list.pl module.............................................................................................. 5Figure 2: driver-list.pl module .................................................................................. 5

2. ftp-files.pl module................................................................................................. 5Figure 3: Contents of tables-partitions file generated by ftp-files.pl module ........... 6Figure 4: File tree associated with ftp-files.pl module ............................................ 7

3. gen-sas-code.pl module......................................................................................... 7Figure 5: File tree associated with gen-sas-code.pl module ..................................... 8SAS code example 1: ................................................................................................ 9

4. gen-sas-sessions.pl module ................................................................................. 10Figure 6: Process Flow for a runid.......................................................................... 11

5. gen-code-cluster-create-spds-tables.pl module................................................... 12Conclusion ...................................................................................................................... 12Contact Information:....................................................................................................... 12

Genesis and Design of the system

An enterprise maintains an extremely large Oracle enterprise data warehouse (EDW)containing hundreds of tables, many of which are partitioned because of the tables’ sizes. TheEDW is refreshed with new data on a monthly basis.Also, quite often there are changes to the EDW schema (new tables added, columnsadded/removed from existing tables, etc.).

This enterprise also needs to easily build and maintain several SAS analyticswarehouses/data marts for different analysis purposes, each of which is composed of adifferent subset of tables from the EDW. The regular monthly EDW refresh process isstaggered over a relatively prolonged period of time, and different sets of tables beingrefreshed are completed and become ready and are available for queries at different dates.

The above-described scenario of EDW operation necessitates each SAS application that usesSAS/ACCESS to Oracle to constantly verify and modify its code.Bearing in mind the fact that many applications – not just SAS – do the same things, usuallycompeting for the same data, one result is the generation of a very large number ofconnections to the Oracle server, producing a large load on that server.

Large Data SetsNESUG 2011

Page 2: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

2

All the above induced the Oracle EDW operational staff to look for a solution that produces adump of the entire Oracle EDW tables into flat delimited compressed data files, and placesthose files on a separate ftp server for use by any application needing that data. Along withthe data files, files containing the data definition language (DDL) for each table are generated.There is one DDL file for either a corresponding set of data files representing a partitionedtable, or one DDL file for its corresponding non-partitioned table.

That presented an opportunity to design and develop a completely automated system thatloads the above data files into SAS, featuring:

choice of any subset of the EDW tables (dumped files) to be loaded into a specificSAS analytic warehouse

making each load incremental, based on the above-mentioned fact that the arrival ofrefreshed tables into dump files is staggered

utilization of SAS Scalable Performance Data Server (SAS/SPDS) storage due to theenormous amount of the EDW data versus the SAS standard library, even though thesystem can handle either destination

implementation of Star schema in the SAS/SPDS library that represents anymeaningful subset of RDBMS (Oracle as an example)

SAS Bulk Loader (SBL) Main Features

SAS Bulk Loader (SBL) is a completely automated batch system for loading EDW tables’data dumped into flat files from any RDBMS into a SAS data mart – SAS/SPDS or base SASlibrary. The SBL preserves the RDBMS schema’s major components – table structure,constraints, and indexes -- by recreating them in SAS/SPDS or a SAS standard library.

In more detail, the main features of the SBL system include the following: the system is completely driven by a dynamic list of tables that will constitute a

SAS/SPDS data mart for specific analytical purposes automated SAS code generation for loading data into SAS from RDBMS by parsing

its metadata - DDL files - dumped from the RDBMS. This eliminates the need to evermanually re-write code because of changes to the schema, etc

automatically generates a number of concurrent SAS sessions to execute thegenerated SAS data step code and necessary procedures that load data into SAS

automated workload balancing between the concurrent SAS sessions based on equaldistribution of data volume for a parameterized number of spawned SAS sessions

automated building of SAS indexes that were present in the RDBMS dynamic clustering of SAS/SPDS member tables corresponding to the RDBMS table

partitions detailed reporting on data exceptions when found by SAS during load of the data comprehensive chain of control between RDBMS EDW tables and SAS tables incremental build of a SAS data mart as the RDBMS tables become available for

loading into SAS designed for optimum performance of rebuilding a SAS data mart from RDBMS

warehouse on a periodic basis

Large Data SetsNESUG 2011

Page 3: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

3

SBL significantly reduces elapsed run time versus using the ‘SAS/Access’ ETLprocess

scheduled batch runs

The SBL system reduced elapsed time from days to hour versus using ‘SAS/Access toOracle’ extracts to build the SAS data mart, benchmarked against over 0.5 TB of data.

SBL System Design and Implementation Overview

The SBL system is written in Perl and consists of a number of modules, each implementing aspecific set of functions of the SBL software. The SBL is driven by a global parameter filethat is processed by each module at execution time. This file enables creation and executionof any completely customizable instance of the entire process. Thus, any number of SBLinstances can be run concurrently. A sample of the parameters in the parameter file includes:

which ftp server to connect to for data retrieval; paths to the lists containing tables still to be loaded into SAS, and tables already

loaded into SAS; root directory for which file tree a particular instance of SBL will operate on; etc.

In addition to the global parameter file, there are two additional files (“to-load-list” and “in-sas-list”) containing information on:

tables still to be loaded tables already loaded

The incremental nature of populating the SAS library with tables is implemented through thefact that every function in the SBL operates on the difference between the “to-load-list” andthe “in-sas-list”. When the difference between the two lists is nil, the table population processis completed, and the SBL run automatically ends.

Figure 1: SBL System Design

Large Data SetsNESUG 2011

Page 4: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

4

SBL modules

The modules that comprise the SBL are described in the order in which they are called andexecuted. Each module works with a set of input files and output files in directories that areidentified by the runid number. There is a set of global files in the $SBL-root directory forthe entire application. The files are:

parm-list to-load-list in-sas-list ftp-runid

o the numeric value of the runid that is stored in this file is used to identify thesubdirectories/file-tree that the current run of the software is using duringprocessing

o this numeric value is incremented for each new ftp run (new set of tables to beprocessed). Thus, when the process of populating a SAS data mart iscompleted, this value designates how many ftp runs it took to complete theprocess.

Large Data SetsNESUG 2011

Page 5: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

5

1. driver-list.pl module

The “to-load-list” file can be created either by using any editor and entering the data, or byreading data from any file. The enterprise for which SBL was originally developed chose tocreate an additional table in the Oracle EDW that contains the information for each enterpriseapplication as to which tables would be available for its use. That specialized table is alsodumped into a flat delimited file during the monthly EDW tables dump. The driver-list.plmodule was created to process the file containing that Oracle table and populate the “to-load-list” file. However, for some other instances of SBL to run, an editor was used to create the“to-load-list” file.

Figure 2: driver-list.pl module

2. ftp-files.pl module

Each dump of a refreshed table from Oracle EDW results in the creation of the followinggroup of related files:

one trigger file indicating that dump of the entire table is complete one data file containing table-data per non-partitioned table, OR one data file

for each partition of a partitioned table one file containing DDL; there will be only one file for an entire table,

regardless of whether or not it is partitioned (since the data definitions are thesame for all partitions in a partitioned table)

Large Data SetsNESUG 2011

Page 6: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

6

one log file for each non-partitioned table, or one for each partition, thatcontains the number of records that was dumped

The ftp-files.pl module searches for all the above files within each group that represents atable. It transfers all these files and builds a ‘hash of arrays’ data structure that links the DDLfile with all non-zero length data files representing every partition of a partitioned table, orone file for a non-partitioned table.

Along with the population of the ftp-runid subdirectory of the FTP Repository, the ftp-files.plmodule creates one file per ftp transmission in the sas-runid SAS Code Repository that iscalled tables-partitions.txt. The tables-partitions file links each DDL file with all itsassociated data file(s). It implements the hash of arrays data structure that is depicted in thefigure below:

Figure 3: Contents of tables-partitions file generated by ftp-files.pl module

The figure below represents the files that are created by ftp-files.pl module for runid=1:

Large Data SetsNESUG 2011

Page 7: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

7

Figure 4: File tree associated with ftp-files.pl module

3. gen-sas-code.pl module

This module consists of many functions which perform the following: parses the DDL and generates a set of intermediate files:

1) <table-or-partition-name>.attrib-stmt: contains a list of all variables and attributesthat will be used in a SAS attribute statement to define all variables to be loaded intoa SAS table

2) <table-or-partition-name>.input-stmt: contains list input style of the variables andinformats that constitute the construction of the SAS input statement

3) <table-or-partition-name>-data-exceptions.txt: text file which will contain thecontents of the PDV and Input Buffer dump if SAS finds invalid data

4) <table-or-partition-name>.ora-indexes: contains Oracle DDL for defining allindexes

5) <table-or-partition-name>.indexes-stmt: contains index create statement(s) whichwill be used by SAS proc datasets to create indexes if this file is non-zero length. Ifthis file is zero length, no SAS proc datasets code will be generated.

6) <table-or-partition-name>.sas: contains generated SAS code which consists of threecomponents: SAS data step that reads a corresponding raw data file and builds a permanent

SAS table in a specified library

Large Data SetsNESUG 2011

Page 8: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

8

SAS proc datasets which creates the indexes for the SAS table SAS macro that verifies that all the attributes that existed in the Oracle table were

successfully recreated in the SAS table: number of rows number of columns indexes

A self-documenting program header is generated which contains: SAS table name number of observations number of variables number of indexes size of table in GB number of data partition files that will be created in SAS/SPDS

for that table (is calculated and varies with the size of the table) size of each data partition (is calculated and varies with the size of

the table)

The figure below represents the files that are created by gen-sas-code.pl module for runid=1:

Figure 5: File tree associated with gen-sas-code.pl module

Large Data SetsNESUG 2011

Page 9: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

9

The SBL SAS generated code for one partition for April, 2008 of the partitioned tableMEDCLM_MTH_LINE is included below:

SAS code example 1:/************************************************************************** This SAS code was generated by:* /home/sassch/sas-bulk-loader/gen-sas-code.pl* SAS Data Set Name: MEDCLM_MTH_LINE_200804 Size: 4.213 GB* Observations: 17008254 Variables: 75 Indexes: 1

Large Data SetsNESUG 2011

Page 10: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

10

* Data Partition File Size: 2.0 GB Number of DPF files: 3*************************************************************************/options errors=max;filename dat_expt "/sasdata3/sas-bulk-loader/sas-code-repository/sasrun-1/MEDCLM_MTH_LINE.MEDCLMLINE_P200804-data-exceptions.txt";

filename dat_read PIPE "/usr/bin/gunzip -c /sasdata3/sas-bulk-loader/ftp-repository/ftprun-1/MEDCLM_MTH_LINE.MEDCLMLINE_P200804.dat.gz" lrecl=4096;

data spdstest.MEDCLM_MTH_LINE_200804(label="<SCHEMA.TABLE-NAME>=EDW_PROD.MEDCLM_MTH_LINE<PARTITION>=MEDCLMLINE_P200804"partsize=2048);

attrib%include

"/sasdata3/sas-bulk-loader/sas-code-repository/sasrun-1/MEDCLM_MTH_LINE.attrib-stmt";

;infile dat_read dlm='a1'X dsd truncover;input

%include"/sasdata3/sas-bulk-loader/sas-code-repository/sasrun-1/MEDCLM_MTH_LINE.input-stmt";

@;file dat_expt;if _error_=1 then do;

put @15 "MEDCLM_MTH_LINE.MEDCLMLINE_P200804" " Record Number: " _N_;put @15 "Record Contents:" _infile_;put @15 "PDV Contents:" (_all_) (=);

end;run;

proc datasets lib=spdstest;modify MEDCLM_MTH_LINE_200804(asyncindex=yes);%include"/sasdata3/sas-bulk-loader/sas-code-repository/sasrun-1/MEDCLM_MTH_LINE.indexes-stmt";

run;

%cot(libref=spdstest,sas_dsname=MEDCLM_MTH_LINE_200804,ora_rows=17008254,ora_columns=75,ora_indexes=1,in_sas_list=/sasdata3/sas-bulk-loader/in-sas-list.txt,schema_dot_tbname=EDW_PROD.MEDCLM_MTH_LINE,sas_runid_repository=/sasdata3/sas-bulk-loader/sas-code-repository/sasrun-1);

As a result of any runid run, a variable number of SAS programs as in the example abovewill be generated by SBL in $SBL_Root/SAS-Code-Repository/sasrun-<[runid-]number>.

4. gen-sas-sessions.pl moduleThe gen-sas-sessions.pl module will combine a number of SAS programs into the number ofsessions specified in the global parameter file parm-list to load the tables into SAS/SPDSLibrary.Based on the fact that the size of each SAS table as well as the number of sessions to becreated were both calculated and stored within the system, this module will group the SAScode into sessions so that each session will process an almost equal amount of data. Thus all

Large Data SetsNESUG 2011

Page 11: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

11

the sessions running concurrently will complete execution at approximately the same time,utilizing similar amounts of system resources.Additionally, this module will spawn UNIX processes for running each session.

The flow of the entire process for one runid is depicted in the figure below:

Figure 6: Process Flow for a runid

Large Data SetsNESUG 2011

Page 12: SAS Bulk Loader of RDBMS Data - Lex Jansen · meaningful subset of RDBMS (Oracle as an example) SAS Bulk Loader (SBL) Main Features SAS Bulk Loader (SBL) is a completely automated

12

5. gen-code-cluster-create-spds-tables.pl moduleThis module generates code to create a SAS/SPDS dynamic clustered table for every Oraclepartitioned table, provided that the ‘chain-of-trust’ SAS macro has reported that eachmember table contains the same attributes as those of the corresponding partition in theOracle partitioned table. Otherwise it reports the failure of the ‘chain-of-trust’ process.

Conclusion

The SBL software utilizes SAS data step’s optimized ability to read flat files, SASprocedures to build constraints/indexes, and clustering of SAS/SPDS member tables for eachpartitioned Oracle EDW table, thus re-creating RDBMS’ star schema in a SAS/SPDS Library,or a standard SAS Library.Since it is a complete, dynamic code generator, no manual SAS coding is ever requiredregardless of how the RDBMS schema changes over time.

Contact Information:

Author: Vlad Svirsky, Principal ConsultantKVtech Corporation247 Parkview AvenueUnit 6TBronxville, NY 10708

Email: [email protected]: 914-274-8848

Large Data SetsNESUG 2011