Teradata Best Practices

48
Teradata Best Practices

description

Teradata Best Practices

Transcript of Teradata Best Practices

Page 1: Teradata Best Practices

Teradata Best Practices

Page 2: Teradata Best Practices

TABLE OF CONTENTS1. INTRODUCTION.............................................................................................................................2

1.1 PURPOSE...................................................................................................................................21.2 SCOPE OF THE DOCUMENT.......................................................................................................2

2. TIPS FOR TERADATA DEVELOPERS......................................................................................2

2.1 DELETE VS DROP A TABLE......................................................................................................22.2 PRODUCT JOINS........................................................................................................................22.3 JOIN COLUMNS OF SAME DATA TYPE.........................................................................................22.4 QUERY VALUES LIMITATION.....................................................................................................22.5 CONSTRAINT ON SECONDARY INDEXES....................................................................................32.6 PRE-REQUISITES FOR TABLE CREATION.....................................................................................32.7 TEST ON SMALLER SAMPLE TABLES..........................................................................................42.8 DON'T SELECT IT IF YOU DON'T NEED IT...................................................................................42.9 USE 'UNION ALL' INSTEAD OF JUST 'UNION'.............................................................................42.10 QUERY TERMINATION...............................................................................................................52.11 RUNNING JOBS IN OFF PEAK HOURS..........................................................................................52.12 UNDERSTANDING CPU.............................................................................................................52.13 IDENTIFYING CPU USAGE.........................................................................................................5

3. EXPLAINS AGAINST ALL QUERIES........................................................................................6

3.1 VISUAL EXPLAIN......................................................................................................................7

4. CODING GUIDELINES (ANSI)..................................................................................................11

4.1 DATATYPES IN JOIN CONDITIONS............................................................................................124.2 CROSS JOINS (PRODUCT JOINS)..............................................................................................124.3 ELIMINATE UNNECESSARY JOINS............................................................................................134.4 DERIVED TABLES....................................................................................................................144.5 USE DERIVED TABLES IN UPDATES........................................................................................154.6 TEMP TABLES.........................................................................................................................164.7 UPDATING MOST ROWS IN A TABLE......................................................................................184.8 CORRELATED SUBQUERIES.....................................................................................................184.9 IN/NOT IN SUBQUERIES..........................................................................................................184.10 IMPLEMENT DISTINCT AS GROUP BY.....................................................................................204.11 TESTER QUERIES.....................................................................................................................20

5. CODING GUIDELINES...............................................................................................................22

5.1 CURRENT_DATE AND DATE..............................................................................................225.2 ROW_NUMBER AND CSUM...............................................................................................225.3 ANSI RANK AND TERADATA RANK....................................................................................225.4 FUNCTIONS & CASE IN JOIN CONDITIONS..............................................................................225.5 DATATYPES IN JOIN CONDITIONS............................................................................................235.6 JOIN OPTIMISATION.................................................................................................................235.7 INDEX SELECTION...................................................................................................................24

6. STATISTICS..................................................................................................................................26

7. PERFORMANCE OPTIMIZATION..........................................................................................29

8. Conclusion.......................................................................................................................................31

Confidential Page 2 20/06/2007

Page 3: Teradata Best Practices

Confidential Page 3 20/06/2007

Page 4: Teradata Best Practices

1. Introduction

The background of this document is to provide best practice

technique of TERADATA development and to maintain

optimized script across the TERADATA zone.

Purpose

The purpose of this paper is to provide a reference point for

developers working within Teradata environment. It aims to

assist Teradata developers in preparing code for Unit Testing,

and in anticipating the types of issues that may be arise

during code reviews.

It also aims to achieve consistency of build across a Teradata

Warehouse by providing standards and guidelines.

Scope of the Document

This document covers standards and guidelines for Teradata

build activities (development). This includes Teradata SQL

scripting and JCL pertinent to Teradata load jobs on

mainframes.

2. TIPS for Teradata DevelopersThis document is a foundation for Teradata development to

enhance and enrich SQL code to elude performance

bottlenecks.

Delete Vs Drop a Table.

When a table needs to be emptied, and then reloaded, it is

more efficient to Delete the rows in the table, rather than

Drop the table and re-create it, as locks on the Dictionary are

avoided.

Confidential Page 4 20/06/2007

Page 5: Teradata Best Practices

Product JoinsProduct joins are the condition occurring when Teradata

compares every row of the first table to every row of the

second table. This process can use huge amount of CPU and

spool.

Product Joins are created, when a join condition is based on an

inequality or totally missing when an 'Alias' has been used to

identify a table, but the Alias has not been used consistently

throughout the SQL to identify the table. As a consequence

Teradata believes that a reference is being made to another

copy of the table, but there is no join condition placed on the

other table.

Sometimes a 'Product Join' is the appropriate option for

Teradata to use, when Teradata believes it needs to compare a

small number of rows from one table to another. However if

the Stats on a Table are incorrect then Teradata may choose a

'Product Join'.

Join columns of same data type

Ideally data types of columns should match when joining

tables because:

o The join is inefficient when the conversion required

o Teradata is unable to compare the demographics of

columns, which are of different data types, even though

Statistics may have been collected. Resulting join may

not be the best choice.

o Depending on the table size, it may be more efficient to

load the data from one table to the other when both

tables’ data type match.

Confidential Page 5 20/06/2007

Page 6: Teradata Best Practices

While join is attempted on part of a column with substring

(e.g. 'Substring'), even if statistics have been collected for

the column, Teradata does not know the data demographics.

The advice is to check the 'Explain' output before populating

into target table. If the user is concerned about performance

then data can be loaded into a temporary table before loading

into target table.

Query Values Limitation

It is tempting to Cut and Paste values into SQL as shown

below:

Select * from ……Where column in (val1,val2,val3,

……………..valn)

However, this is inefficient because Teradata does not easily

share this work across all of its processors. Therefore it is

better to insert the data into a table and code:

Where column in (select column from Tablename);

It is recommended to use this statement when there are more

than 50 values in the query.

Constraint on Secondary Indexes

If a table requires a Secondary index, create the index after

data has been loaded into the table to ensure that the load

process is completed.

If an Update or Delete is being performed, remove all

secondary indexes before applying the change. Once the

change is completed then re-create secondary index.

Confidential Page 6 20/06/2007

Page 7: Teradata Best Practices

Pre-requisites for table creation

Since default form of Table is 'Set' table, it is used by most of

the developers without analyzing; however a Volatile Table is

much better for temporary storage. The reason of creating

Volatile Table is the only table type which does not take

restrictive locks on the Dictionary.

Listed below are the 4 different table types and some of their

characteristics:

Set Table : Duplicate rows not allowed Teradata

default.

Multiset Table : Duplicate rows allowed.

Volatile Table : Available as long as the session is alive.

Rows exist

For the transaction when the table has been

defined as 'On Commit preserve rows'.

Statistics can not be collected on the

volatile table.

Global Temp Table : Same as Volatile concept, difference is

that Definition is

stored in Data Dictionary and data is

deleted at the end

of the session. Statistics can be collected.

Use of 'Create Table As' can be a useful to create new tables

based on the existing table(s). Be aware that Teradata will

create the new table using defaults which might not be the

way how developer actually wishes the table to be structured.

The defaults may vary from release to release. Best practice

is to explicitly state all columns attributes (NOT NULL etc) and

indexes (UNIQUE).

Confidential Page 7 20/06/2007

Page 8: Teradata Best Practices

Test on smaller sample tables.

The amount of CPU time used to process a query is

dependent on a number of factors, but in many cases the CPU

usage will be proportional to the size of the table. If (iterative)

testing is conducted on a small extract of the live table then

the amount of CPU will be reduced.

Don't select it if you don't need it.

Processing will be more efficient if SQL excludes rows through

a specific clause in the 'Where' condition, rather than relying

on the Join condition to exclude them from the final result.

For example if some rows have all zeros in a column because

it is a special case, don't rely on them failing the join

condition to exclude them. In this particular case not only is it

inefficient to compare the contents of all the rows containing

zeros, but they can skew the data (instead of spreading the

data across all of the Teradata processors, it is concentrated

on a single processor and takes much longer to run).

Use 'Union All' instead of just 'Union'

When creating a Union of 2 set of rows, the default form of

the statement will check for the presence of duplicate rows,

which is unnecessary if duplicates are acceptable. Most of the

situations it is known that duplicates can not possibly exist,

and if they do exist then it is correct to select them.

Otherwise better go for 'Union All' which recognizes that

duplicates may exist.

Confidential Page 8 20/06/2007

Page 9: Teradata Best Practices

Query Termination

If the query is not performing as per the expectation then

consider to abort. However, killing a job which is performing

large Updates or Inserts on a non-empty table can cause

rollbacks, which in turn cause higher CPU Consumption.

Running jobs in off peak hours

The Teradata system is less heavily used at night and at

weekends.

Using that spare capacity increases the availability of

resources during peak time.

Understanding CPU

To understand the scale of CPU related problems, it is

important to know the CPU resources that are available on

the 5350 Teradata system.

Assume, the Operational machine (NCR5350) has 24 Nodes,

and each Node has 2 CPUs. Therefore in any second there are

48 CPU secs of processing power would be available to share.

If the number of poorly written query increases, it is obvious

that inefficient SQL will take more resources, take longer to

complete, and increase the congestion - impacting everyone.

To give an example of how much of CPU can be used by a

simple and efficient query on a large table:

select count(*) from t_bcard_restdb.bc_history where

account_status_code = 'zz'; This resulted in a Table scan,

Confidential Page 9 20/06/2007

Page 10: Teradata Best Practices

which has 667 million rows and occupies 166GB. The query

used 820 CPU secs.

Identifying CPU usage

The amount of CPU used by each user can be found from

A_Usage_base.UsageHistory, which is accessible to all users.

3. Explains Against All Queries

When developing queries, always perform “An explain” on the

query before you run it. Explains can give a lot of information the

way optimizer will execute a query.

To perform an “Explain”, simply add the explain keyword prior to

your select/ insert/ update/ delete statement and execute it.

The Explain statement is used to aid in identifying potential

performance issues, it analyses the SQL and breaks it down into

its low level process. Unfortunately the output can be very

difficult to understand for an untrained person, but there are

some points to recognize: Confidence Level and Product Joins.

Confidence Level

Teradata attempts to predict the number of rows which will result

at each stage in the processing, and will qualify the prediction

with a confidence level as shown below:

No Confidence – There is no statistics available.

Low Confidence – Statistics are difficult to use precisely.

High Confidence - Optimizer is sure of the results based on

the stats available.

Explain Select * from DB1.Table1;

It will produce output like this:

Confidential Page 10 20/06/2007

Page 11: Teradata Best Practices

1) First, we lock DB1.Table1 for access.

2) Next, we do an all-AMPs RETRIEVE step from

DB1.Table1 by way of an all-rows scan

with no residual conditions into Spool 1, which is built locally

on the AMPs. The size of Spool 1 is estimated with high

confidence to be 141 rows. The estimated time for this step is

0.15 seconds.

3) Finally, we send out an END TRANSACTION step to all AMPs

involved

in processing the request.

-> The contents of Spool 1 are sent back to the user as the

result of

statement 1. The total estimated time is 0.15 seconds.

Whilst this output is not all that user friendly (especially for

complex queries) a number of useful things are present in the

output, including a total estimated time to run the query, the join

methods used, and the confidence levels that the optimizer has

used while calculating number of rows/elapsed time that the

query steps will take.

The best way to use “Explain" is to compare how your changes

are affecting the way a query will run. Do this by changing the

query a little at a time observing how your changes affect the

complexity and total estimated time of the “Explain”.

REMEMBER TO ENSURE STATS HAVE BEEN COLLECTED ON PI,

JOIN AND SELECTION COLUMNS!

If stats are not present for the required columns, one can notice a

number of Low Confidence estimations in the Explain. See the

statistics section of this document for further information on

Confidential Page 11 20/06/2007

Page 12: Teradata Best Practices

checking stats have been collected, and arranging their

collection/refresh.

Visual Explain

In addition to the text explain shown above, one can also perform

visual explains of queries using the visual explain tool.

To capture a visual explain (into a special explain database) you

add the insert explain modifier to the top of your SQL.

Insert Explain into QCD as myquery1

Select * from DB1.Table1;

The above places the “Explain” into query capture database

(QCD) and gives it a name of myquery1 to easily distinguish it

from the other explains that are present in the QCD database.

Always give your queries names for this reason.

Next you must start Visual Explain 2.0.1 and connect to the

database. Once Visual Explain has started up click on the

connect button (leftmost one on the toolbar), or choose File|Logon

to connect to the database.

Once you are connected you will need to retrieve your plan

(generated above) from the QCD database. To do this click on

the Open Plans toolbar button (3rd from left), or File|Open Plans

from Database.

Confidential Page 12 20/06/2007

Page 13: Teradata Best Practices

You will be presented with the following dialog box. Make sure

the database box reads ‘QCD’ (select it from the list or type it in),

and type in the query name you used when inserting the Explain

into Query Tag box.

Once that is done, click the ‘Browse QCD’ button and your query

should be displayed in the Available Execution Plans dialog box.

Confidential Page 13 20/06/2007

Page 14: Teradata Best Practices

Find the one you want and click the checkbox for it, then press

the ‘Add ’ button, finally click on the ‘Open’ button.

Confidential Page 14 20/06/2007

Page 15: Teradata Best Practices

You will then have to wait for 10-30 seconds while the explain is

loaded from the database and formatted onto the screen. When

complete the screen should look something like this:

You will notice that a summary window appears with a total

estimated cost and number of steps etc. as well as a flowchart

type plan with the various steps undertaken to run the query

(practically identical to the text based explain).

You can click on the various steps and pop-up windows will give

you more information about them. You can print queries/explains

etc. and also compare two explains to get a report of the

differences. Please refer to the Visual Explain manual for more

information.

Confidential Page 15 20/06/2007

Page 16: Teradata Best Practices

More information on using the Visual Explain output is to be

added to this paper in the near future based on the new release.

4. CODING GUIDELINES (ANSI)

It helps code readability; quality and often performance, if queries

are written using ANSI join syntax. This makes it easy to separate

the join conditions from the row selection conditions. This gets

more important as more tables are being joined, and more

selection conditions are intermingled with the join conditions in a

where clause.

An example of ANSI join syntax is shown below:

Select tb1.Col1 tb2.Col2 tb3.Col3 tb4.Col4From Table1 tb1 inner join Table2 tb2 on (tb1.col1 = tb2.col1) inner join Table3 tb3 on (tb1.col1 = tb3.col1) inner join Table4 tb4 On (tb1.col1 = tb4.col1)Where Tb1.Col9 > 12345And Tb4.Batch_Id = 33401;

Whenever ANSI join syntax is employed to join two tables, and

there is a selection condition on one of the tables involved in the

join, you should place that condition within the ‘on’ clause of the

join concerned to help the optimiser do the filtering prior

Confidential Page 16 20/06/2007

Page 17: Teradata Best Practices

to/during the join, rather than after the join (which is less

efficient).

An example of this is shown below:

Untuned:

Select tb1.Col1 tb2.Col2From Table1 tb1 inner join Table2 tb2 on (tb1.col1 = tb2.col1)Where Tb2.col3 = 50;

Tuned:

Select tb1.Col1 tb2.Col2From Table1 tb1 inner join Table2 tb2 on (tb1.col1 = tb2.col1 and Tb2.col3 = 50);

If both tables in a join have selection conditions it is usually

advisable to place a single selection condition in the on clause,

with the other one placed in the where clause. It is preferable to

have the condition in the on clause for the table with the most

rows.

Functions in Join Conditions

Do not put functions (e.g. cast or especially case statements) into

table columns involved in join conditions unless this is absolutely

necessary. Sometimes a derived table may be a more

appropriate solution to this kind of problem.

Confidential Page 17 20/06/2007

Page 18: Teradata Best Practices

Datatypes in Join ConditionsJoins should be performed by one or more conditions matching

two columns with identical datatypes. If the datatypes differ,

primary indexes will not be used, and in effect a cast will need to

be performed on each value (i.e. there is a function in the join

condition – see above) before the value can be joined.

Cross Joins (Product Joins)Product joins (i.e. joining two tables without any join conditions)

are to be avoided. The only exception is in the case of joining to

a table or subquery which returns a single row. An example of a

permissible product join would be joining the rows to be added to

a TR table with the subquery that returns the current maximum

ID present in the TR table. One of the benefits of using the ANSI

join syntax is that it makes these unconstrained joins very easy to

spot since they have no on clause.

Eliminate Unnecessary JoinsIf you are running a sequence of queries in a job and you appear

to be continually using the same join to a table to restrict rows

(e.g. the join to process_batch to obtain a batch_no in staging to

target conversions), place that value in a file (or temp table) and

import the value from the file and use it as a literal in your query.

This can save a great deal of time when joining to large tables,

and also makes it easier to ensure that the work being performed

is consistent (e.g. that the batch you started processing in step 1

is the same one you are processing in the final job step).

An example concerning batch numbers is shown below:

Confidential Page 18 20/06/2007

Page 19: Teradata Best Practices

Bad:

select tb2.Col1, tb3.Col2, tb4.Col3from Table2 tb2, Table3 tb3, Staging_Table stg, Process_Batch pbaWhere tb2.Col4 = 123 and tb2.Col1 = tb3.Col1 and stg.Col6 = tb2.Col6 and pba.batch_no = stg.batch_no and pba.process_id = 201 and pba.batch_status_cd = 'I';

Good:

/* First step of job */.EXPORT DATA DDNAME BATCHNO;select batch_no (CHAR(10))from Process_BatchWhere Process_Id = 201 and Batch_Status_Cd = ‘I’;.EXPORT RESET;

/* Typical job step */.IMPORT DATA DDNAME BATCHNO; .REPEAT 1; USING BATCHNO(CHAR(10)) Select tb2.Col1, tb3.Col2, tb4.Col3from Table2 tb2, Table3 tb3, Staging_Table stg Where tb2.Col4 = 123 and tb2.Col1 = tb3.Col1 and stg.Col6 = tb2.Col6 stg.batch_no = cast(:BATCHNO as INTEGER);

NOTE: All processing involving batch numbers in staging to

target transformations should employ this logic.

Derived Tables

Confidential Page 19 20/06/2007

Page 20: Teradata Best Practices

Derived tables can be a powerful technique to produce efficient

queries, but they can also cause major performance problems

when used in inappropriate situations.

Here are some guidelines for the use of derived tables:

Never use a derived table to simply restrict the records of a large

table prior to joining it to some other table. Doing this prevents

the optimizer from using statistics on the table when it is

subsequently joined to another table, since the derived table is

pulled into a spool file, and this spool file will not have statistics

available to the optimizer to prepare downstream joins.

Do not use a derived table composed of a query which contains

the same tables that are joined outside the derived table unless

you have to perform aggregation or some other operation within

the derived query that cannot be performed against those tables

in the base query.

A permissible example would be a derived table which gets the

keys of the latest records in a table (e.g. max(Txn_Date)) and is

joined to the same table in the base query to get the latest

records.

Use of a derived table may be appropriate when it significantly

reduces the complexity/increases the readability of a query.

An example is the use of a derived table in from clause of an

update statement. This is the recommended way to write an

update.

General approach should be use of temporary tables instead of

derived table if the expected dataset or involved table(s) have

more then 250K records (1000 rows/AMP * 240 AMPs).

Confidential Page 20 20/06/2007

Page 21: Teradata Best Practices

Use Derived Tables in Updates

It can significantly reduce query complexity and improve

performance and readability if updates are written with from

clause as a derived table. This is particularly useful when there are many table involved in the

query. For example

Instead of:

UPDATE TB1FROM

Table1 TB1, Table2 TB2, Table3 TB3, Table4 TB4

SET TB1.COL3 = TB4.COL3

WHERE TB1.COL1 = TB2.COL1AND TB1.COL2 = TB3.COL2AND TB2.COL1 = TB3.COL1AND TB2.COL1 = TB4.COL1AND TB2.COL4 = 123 ;

This is preferred:

UPDATE Table1 FROM( SELECT

TB2.COL1, TB3.COL2, TB4.COL3

FROMTable2 TB2

INNER JOINTable3 TB3

ON TB2.COL1 = TB3.COL1

INNER JOINTable4 TB4

ON TB2.COL1 = TB4.COL1

WHERE TB2.COL4 = 123) XXX

SET Table1.COL3 = XXX.COL3

WHERE Table1.COL1 = XXX.COL1AND Table1.COL2 = XXX.COL2;

Whilst the code may be longer, the select which obtains the

rows to be updated can be prototyped in isolation of the update,

and therefore can also be individually tuned separate to the

update. Logic errors that might be hidden in an all-in-one

update statement become much more visible when written as a

derived query.

Confidential Page 21 20/06/2007

Page 22: Teradata Best Practices

Derived tables can be a powerful technique to produce efficient

queries, but inappropriate usage can cause performance

problems.

Guidelines for the use of derived tables:

Use of a derived table may be appropriate when it

significantly reduces the complexity/increases the readability

of a query. An example is the use of a derived table in from

clause of an update statement. This is the recommended

way to write an update ( Use Derived Tables in Updates

above).

The general approach should be use of temporary tables

instead of derived table if the expected dataset or involved

table(s) have more then 250K records (1000 rows/AMP * 240

AMPs).

Temp TablesThe following three types of temporary tables can be used in

transformations in order to hold intermediate results:

Use Derived / Temp Tables in Updates

It can significantly reduce query complexity and improve

performance and readability if updates are written with from

clause as a derived table.

Not advisable:

Update tb1From Table1 tb1, Table2 tb2, Table3 tb3, Table4 tb4Set

Advisable:

update Table1 from ( select tb2.Col1, tb3.Col2, tb4.Col3 from

Confidential Page 22 20/06/2007

Page 23: Teradata Best Practices

Tb1.Col3 = tb4.Col3Where Tb1.Col1 = tb2.Col1and tb1.Col2 = tb3.Col2and tb2.Col1 = tb3.Col1and tb2.Col1 = tb4.Col1and tb2.Col4 = 123;

Table2 tb2 Inner join Table3 tb3 on (tb2.Col1 = tb3.Col1) inner join Table4 tb4 on (tb2.Col1 = tb4.Col1) where tb2.Col4 = 123) xxxset Table1.Col3 = xxx.Col3where Table1.Col1 = xxx.Col1and Table1.Col2 = xxx.Col2;

Whilst the code may be longer, the select which obtains the rows

to be updated can be prototyped in isolation of the update, and

therefore can also be individually tuned separate to the update.

Logic errors that might be hidden in an all-in-one update

statement become much more visible when written as a derived

query.

Taking into account limitations of current ETI version in

generating updates from derived tables, a temporary table(s)

should be used for updates:

insert into tmp_tbl1select tb2.Col1, tb3.Col2, tb4.Col3 from Table2 tb2 Inner join Table3 tb3 On (tb2.Col1 = tb3.Col1) Inner join Table4 tb4 On (tb2.Col1 = tb4.Col1) Where tb2.Col4 = 123;

Confidential Page 23 20/06/2007

Page 24: Teradata Best Practices

update Table1 from tmp_tbl1set Table1.Col3 = tmp_tbl1.Col3where Table1.Col1 = tmp_tbl1.Col1and Table1.Col2 = tmp_tbl1.Col2;

delete tmp_tbl1;

It should be mentioned here, that both temporary and target

tables should have the same primary index. If population of a

temporary table and update of the target table can be performed

within one session then global temporary tables with NO LOG

option can be used, otherwise temporary table rows should be

deleted in the end of the job step.

Updating Most Rows in a TableIf a query updates a significant proportion of the rows in a table

(especially a large one) it is likely that the query should be

redesigned, since activity of this type generally performs poorly.

Redesign of this kind of query to an insert into an empty

temporary table from a select query often yields performance

improvements of an order of magnitude or greater. Once the

insert into temporary table is complete, original can be swapped

and temp table renamed as the original.

Correlated SubqueriesCorrelated subqueries (i.e. subqueries that reference a column in

a table from the base query), if specified on inequality conditions,

are generally an extremely inefficient way to process large

Confidential Page 24 20/06/2007

Page 25: Teradata Best Practices

datasets. They perform row by row operations which can be very

high cost and are rarely an appropriate solution. In general these

are to be used in conjunction with (NOT) EXISTS predicate for

performing exclusion merge joins.

In/Not In SubqueriesQueries which do delta processing through the use of NOT IN

subqueries should generally be written as left outer joins or

exclusion joins with NOT EXISTS predicate. Non-delta processing

may make use of subqueries to provide the not-in list, but this

should only be performed if the list of values returned by the

subquery is reasonably low (e.g. less than 250K). If the expected

dataset of the subquery is greater then 250K then NOT EXISTS

should be used.

Bad:

Insert into Table1 ( Col1, Col2)Select Tb2.Col1 Tb2.Col2From Table2 tb2Where Tb2.Col1 not in ( Select Col1 from Table1 );

Good:

Insert into Table1 ( Col1, Col2)Select Tb2.Col1 Tb2.Col2From Table2 tb2 Left outer join Table1 tb1 On (tb1.Col1 = tb2.Col2)Where Tb1.Col1 is null;

Queries which use IN subqueries should generally be written as

inner joins, except where the subquery returns one, or very few

records.

Confidential Page 25 20/06/2007

Page 26: Teradata Best Practices

NOTE: Only write in subqueries as inner joins if the table column

in the subquery is unique. If the column(s) is non-unique

duplicate records will be introduced, corrupting the query results.

Bad:

Select Tb2.Col1 Tb2.Col2From Table2 tb2Where Tb2.Col1 in ( Select Col1 from Table1 );

Good:

Select Tb2.Col1 Tb2.Col2From Table2 tb2 Inner join Table1 tb1 On (tb1.Col1 = tb2.Col2);

Implement Distinct as Group By

Often the way Teradata performs a select distinct produces large

spool files and results in a less efficient plan than using a group

by to perform the same task. It is preferable to code your

DISTINCT as GROUP BY as follows:

Less Desirable:

Select Distinct Tb2.Col1, Tb2.Col2From Table2 tb2;

More Desirable:

Select Tb2.Col1, Tb2.Col2From Table2 tb2Group by Tb2.Col1, Tb2.Col2

Confidential Page 26 20/06/2007

Page 27: Teradata Best Practices

Tester Queries

Be specific when checking transformation results, When checking

the records contained within a staging table have been correctly

applied to a target table (esp. large staging/target tables) do not

join the tables together to check this. Joining the tables will

produce large result sets (often millions of rows), and you will

only ever eyeball a few of these records, so why do the join?

A much better way of performing this check is to select a sample

of the records that were added/ modified by the staging to target

job from the target table. To do this simply do a SELECT against

the target table qualifying the batch number to be the one just

loaded. Add the sample modifier to your query to return a

random sample of say 10 rows (TIP: Use sample whenever you

want to get a representative set of records from a large table).

Select * from DB1.TABLE1Where Load_Batch_No = 33400SAMPLE 10;

This will return a random sample of 10 rows that should have

corresponding values present in the staging table. To check the

contents of target table against staging table, run a series of

short simple queries against source table to see what was in the

corresponding staging table records, substituting in the

appropriate key values from results of SELECT performed above.

Select * from DB1.STG1Where ADDR_LN_ONE = ‘some data from the corresponding field in the target table’

Confidential Page 27 20/06/2007

Page 28: Teradata Best Practices

And ADDR_LN_TWO = ‘some data from the corresponding field in the target table’

And ADDR_LN_THREE = ‘some data from the corresponding field in the target table’;

You then check that appropriate columns in the second query

(against staging table) match those in the first query (against

target table). This will be much faster than joining the tables – let

yourself do the join on only those records that matter!

5. CODING PROCESS

CURRENT_DATE and DATE

Confidential Page 28 20/06/2007

Page 29: Teradata Best Practices

CURRENT_DATE is to be used in SQL in preference to DATE.

CURRENT_DATE is ANSI compliant, and whilst DATE is supported

by Teradata, it is only currently supported for backwards

compatibility (may not be supported in future releases).

ROW_NUMBER and CSUM

SQL function ROW_NUMBER is to be used in preference to

CSUM. ROW_NUMBER is ANSI compliant, and whilst CSUM is

supported by Teradata, it is only currently supported for

backwards compatibility (may not be supported in future

releases).

ANSI RANK and Teradata RANK

SQL window function RANK is to be used in preference to

Teradata RANK function. The ANSI RANK window function is

recommended by Teradata in preference to the non ANSI

Teradata RANK function. The latter is only currently supported

for backwards compatibility (may not be supported in future

releases).

Functions & Case in Join Conditions

Avoid using Case statements and functions in table join

conditions, unless it is absolutely necessary. If the tables being

joined contain unmatched data types, rather than using the

CAST function in the join condition, a derived table or a

temporary table into which the column is first cast may be a

more appropriate solution to this kind of problem.

Confidential Page 29 20/06/2007

Page 30: Teradata Best Practices

Datatypes in Join Conditions

Wherever possible, columns involved in join conditions should

be similar in data type. For example, Character to Character

comparisons (of any length, and regardless of whether they are

VARCHAR or CHAR) is acceptable. Numeric to numeric (such as

INTEGER to DECIMAL) is also acceptable. Date to Date, Time to

Time, etc… If the columns are not similarly typed, the SQL

parser will throw in a CAST function. The CAST operation may

or may not make logical sense for the required purpose, and

additionally there will be a performance degradation (see

Functions & Case in Join Conditions above).

When there is a choice, it is preferable to use numeric columns

in join conditions, as character comparisons are always more

CPU intensive.

Join optimisation When possible, columns involved in the join conditions should be the primary indices of

each of the tables involved in that join. This is easiest to adhere to when using intermediary

temporary tables as it is then that the developer has maximum control of index choice. If it

is not possible to choose the primary index of both tables involved, it still helps if at least

one of the tables in the join has an index which best fits that join.

For example, assume we are aware the following SQL is required:

SELECTTB1.COL1

, TB2.COL2FROM PREFIX1.TABLE1 TB1INNER JOIN PREFIX2.TABLE2 TB2

ON TB1.COL1 = TB2.COL2AND TB1.COL3 = TB2.COL4;

We should then strive to have corresponding table definitions as follows:

CREATE SET TABLE PREFIX1.TABLE1, NO FALLBACK,

Confidential Page 30 20/06/2007

Page 31: Teradata Best Practices

NO BEFORE JOURNAL,NO AFTER JOURNAL(

.

.

.)

PRIMARY INDEX(COL1, COL3);

and:

CREATE SET TABLE PREFIX1.TABLE1, NO FALLBACK,NO BEFORE JOURNAL,NO AFTER JOURNAL(

.

.

.)

PRIMARY INDEX(COL2, COL4);

Additionally, if the indices are unique primary indices,

performance will be further enhanced.

Avoid too many joins, in particular when at least one of the

tables is large (eg. history tables). It is better to break such

a query up into several temporary table populations, where

the number of joins can be minimised, and statistics are

collected along the way.

Index Selection

Index selection for the final target table will generally be driven

by the expected end-user usage of the table. This may not

necessarily equate to load optimisation (which should be a

secondary consideration). However, the developer should

always consider load optimisation when creating Temporary

tables.

Primary index choice for faster data loading:

Confidential Page 31 20/06/2007

Page 32: Teradata Best Practices

1. Choose an index composed of the columns which will be

used in subsequent joins (see Join optimisation above)

2. Choose an index which gives as high level of uniqueness as

possible. This helps Teradata to better distribute the work

load across the AMPS.

3. Use the following query (preferably against production data) to see how well (or not well)

distributed a given index is:

SELECTHASHAMP(

HASHBUCKET(HASHROW( column1 , … , columnI , … ,

columnN))) ,COUNT(*)

FROM PREFIX.TABLEGROUP BY 2

Where: column1 is the first column of the chosen index,columnI is the Ith column of the chosen index,columnN is the Nth column of the chosen index,PREFIX is the table prefix for the table whose index is

being prototyped, andTABLE is the table whose index is being prototyped.

A result set will be produced where the first column is an AMP

identifier, and the second column is how many rows from the

table would be stored on that AMP given the chosen index. It

is also possible to prototype a non-existent table by using the

above functions on the subquery, where the subquery is the

SQL which would otherwise be used to populate the non-

existent table.

Use the above result to check for bad distribution (also

known as skewed distribution).

Confidential Page 32 20/06/2007

Page 33: Teradata Best Practices

6. STATISTICS

In order for the optimiser to produce a good query plan statistics

must be collected on all tables and columns involved in a query

as per the following list:

Primary Indexes of all tables in a query

All columns involved in join conditions

All columns involved in row selection conditions (where

clause)

When processing SQL that joins 2 or more tables, Teradata's

choice of join plan is totally dependent on its knowledge of the

values of the data in the columns referenced in the SQL.

Statistics must be collected:

On any column which is involved in the Join condition.

On any column which is part of the 'WHERE' condition in the

SQL.

After the data has been loaded, or reloaded, or significantly

updated.

If Statistics are not collected or are not current, and the wrong

plan is used by Teradata, then many thousands of CPU secs can

be used instead of a few hundred. The elapsed time of queries is

Confidential Page 33 20/06/2007

Page 34: Teradata Best Practices

frequently reduced from hours to minutes through judicious

collection of statistics.

To determine if stats have been collected use the help statistics

command as per the below example:

help statistics DB1.TABLE1;

*** Help information returned. 11 rows.

*** Total elapsed time was 1 second.

Date Time Unique Values Column Names

-------- -------- -------------------- ------------------------------------

02/07/30 02:27:15 63,677,477 TB_Id

02/07/30 02:27:20 17 Tp_Cd

02/07/30 02:28:53 39,595,243 Cust_id

02/07/30 02:28:59 15,135 Cust_Org_Unit

If statistics you require for your query are not listed, contact the

DBA group and let them know exactly what table/column/index

requires stats to be collected and they will perform the collection

and get back to you.

Please remember that statistics are ESSENTIAL for the efficient

execution of SQL, and this should ALWAYS be the first thing

checked when developing or tuning a query.

To determine if statistics have been collected use the help statistics command as per the below

example:

Help statistics PREFIX.party;

*** Help information returned. 11 rows.*** Total elapsed time was 1 second.

Date Time Unique Values Column Names-------- -------- -------------------- ------------------------------02/07/30 02:27:15 63,677,477 cust_Id02/07/30 02:27:20 17 Tp_Cd

Confidential Page 34 20/06/2007

Page 35: Teradata Best Practices

02/07/30 02:28:53 39,595,243 cust_id02/07/30 02:28:59 15,135 cust_Org_Unit02/07/30 02:29:08 182 Carr_Cd02/07/30 02:29:13 144 Ctry_Cd02/07/30 02:29:55 3 Batch_run

If required statistics for a given query are not listed, it may be

necessary to execute an appropriate collect statistics statement

(see below).

Collect Statistics specifying columns/indices (“explicit”), some examples:

COLLECT STATISTICS ON PREFIX.TABLECOLUMN Col1;

COLLECT STATISTICS ON PREFIX.TABLECOLUMN (Col1, Col2);

COLLECT STATISTICS ON PREFIX.TABLEINDEX (Col2, Col3, Col4);

This type of collect statistics statement must run against a

table before the “implicit” type (below) can be executed. It will

not only collect the nominated statistics for the specified table,

but will also store information in Teradata which allows for the

“implicit” collect statistics to run.

Collect Statistics without specifying columns/indices (“implicit”):

COLLECT STATISTICS ON PREFIX.TABLE;

This type of collect statistics assumes that all of the columns

that were individually nominated in previous collect statistics

statements are to have statistics re-collected.

Please remember that statistics are ESSENTIAL for the

efficient execution of SQL, and this should ALWAYS be

the first thing checked when developing or tuning a

query.

Confidential Page 35 20/06/2007

Page 36: Teradata Best Practices

7. PERFORMANCE OPTIMIZATION

Partition Primary Index Advantage

It increases the query efficiency by avoiding full table scans

without the overhead and maintenance costs of secondary

indexes.

For example, assume a sales data table has 5 years of sales

history.

- A PPI is placed on this table which partitions the data into

60 partitions (one for each month of the 5 years).

A user executes a query which examines two months of sales

data from this table With a PPI, this query only needs to read 2

partitions of the data from each AMP. Only 1/30 of the table has

to be read.

Confidential Page 36 20/06/2007

Page 37: Teradata Best Practices

Range queries can be executed on tables without secondary

indexes. With previous Teradata releases, a value ordered NUSI

can be used to help increase performance of queries that

qualify ranges or rows, but they entail NUSI subtable perm

space and maintenance overhead. The more partitions there

are, the greater the potential benefit.

Potential Disadvantage

A query specifying a PI value, but no value for the partitioning

column, must look in each partition for that value.

When joining, if one of the tables is partitioned, the rows won't

be ordered the same, and the task, in-effect, becomes a set of

sub-joins, one for each partition of the PPI table.

The disadvantage is proportional to the number of partitions,

with fewer partitions being better than more partitions.

Access with and Without Partitioning

Confidential Page 37 20/06/2007

Page 38: Teradata Best Practices

8. Conclusion

Confidential Page 38 20/06/2007

Page 39: Teradata Best Practices

The paper highlights the significance of best practice methodologies

of TERADATA and it is thoroughly evaluated to do away with the high

maintenance cost of re-writing scripts. The options have been

examined across the factors and the best option has been given.

Confidential Page 39 20/06/2007