Teradata Best Practices
description
Transcript of Teradata Best Practices
Teradata Best Practices
TABLE OF CONTENTS1. INTRODUCTION.............................................................................................................................2
1.1 PURPOSE...................................................................................................................................21.2 SCOPE OF THE DOCUMENT.......................................................................................................2
2. TIPS FOR TERADATA DEVELOPERS......................................................................................2
2.1 DELETE VS DROP A TABLE......................................................................................................22.2 PRODUCT JOINS........................................................................................................................22.3 JOIN COLUMNS OF SAME DATA TYPE.........................................................................................22.4 QUERY VALUES LIMITATION.....................................................................................................22.5 CONSTRAINT ON SECONDARY INDEXES....................................................................................32.6 PRE-REQUISITES FOR TABLE CREATION.....................................................................................32.7 TEST ON SMALLER SAMPLE TABLES..........................................................................................42.8 DON'T SELECT IT IF YOU DON'T NEED IT...................................................................................42.9 USE 'UNION ALL' INSTEAD OF JUST 'UNION'.............................................................................42.10 QUERY TERMINATION...............................................................................................................52.11 RUNNING JOBS IN OFF PEAK HOURS..........................................................................................52.12 UNDERSTANDING CPU.............................................................................................................52.13 IDENTIFYING CPU USAGE.........................................................................................................5
3. EXPLAINS AGAINST ALL QUERIES........................................................................................6
3.1 VISUAL EXPLAIN......................................................................................................................7
4. CODING GUIDELINES (ANSI)..................................................................................................11
4.1 DATATYPES IN JOIN CONDITIONS............................................................................................124.2 CROSS JOINS (PRODUCT JOINS)..............................................................................................124.3 ELIMINATE UNNECESSARY JOINS............................................................................................134.4 DERIVED TABLES....................................................................................................................144.5 USE DERIVED TABLES IN UPDATES........................................................................................154.6 TEMP TABLES.........................................................................................................................164.7 UPDATING MOST ROWS IN A TABLE......................................................................................184.8 CORRELATED SUBQUERIES.....................................................................................................184.9 IN/NOT IN SUBQUERIES..........................................................................................................184.10 IMPLEMENT DISTINCT AS GROUP BY.....................................................................................204.11 TESTER QUERIES.....................................................................................................................20
5. CODING GUIDELINES...............................................................................................................22
5.1 CURRENT_DATE AND DATE..............................................................................................225.2 ROW_NUMBER AND CSUM...............................................................................................225.3 ANSI RANK AND TERADATA RANK....................................................................................225.4 FUNCTIONS & CASE IN JOIN CONDITIONS..............................................................................225.5 DATATYPES IN JOIN CONDITIONS............................................................................................235.6 JOIN OPTIMISATION.................................................................................................................235.7 INDEX SELECTION...................................................................................................................24
6. STATISTICS..................................................................................................................................26
7. PERFORMANCE OPTIMIZATION..........................................................................................29
8. Conclusion.......................................................................................................................................31
Confidential Page 2 20/06/2007
Confidential Page 3 20/06/2007
1. Introduction
The background of this document is to provide best practice
technique of TERADATA development and to maintain
optimized script across the TERADATA zone.
Purpose
The purpose of this paper is to provide a reference point for
developers working within Teradata environment. It aims to
assist Teradata developers in preparing code for Unit Testing,
and in anticipating the types of issues that may be arise
during code reviews.
It also aims to achieve consistency of build across a Teradata
Warehouse by providing standards and guidelines.
Scope of the Document
This document covers standards and guidelines for Teradata
build activities (development). This includes Teradata SQL
scripting and JCL pertinent to Teradata load jobs on
mainframes.
2. TIPS for Teradata DevelopersThis document is a foundation for Teradata development to
enhance and enrich SQL code to elude performance
bottlenecks.
Delete Vs Drop a Table.
When a table needs to be emptied, and then reloaded, it is
more efficient to Delete the rows in the table, rather than
Drop the table and re-create it, as locks on the Dictionary are
avoided.
Confidential Page 4 20/06/2007
Product JoinsProduct joins are the condition occurring when Teradata
compares every row of the first table to every row of the
second table. This process can use huge amount of CPU and
spool.
Product Joins are created, when a join condition is based on an
inequality or totally missing when an 'Alias' has been used to
identify a table, but the Alias has not been used consistently
throughout the SQL to identify the table. As a consequence
Teradata believes that a reference is being made to another
copy of the table, but there is no join condition placed on the
other table.
Sometimes a 'Product Join' is the appropriate option for
Teradata to use, when Teradata believes it needs to compare a
small number of rows from one table to another. However if
the Stats on a Table are incorrect then Teradata may choose a
'Product Join'.
Join columns of same data type
Ideally data types of columns should match when joining
tables because:
o The join is inefficient when the conversion required
o Teradata is unable to compare the demographics of
columns, which are of different data types, even though
Statistics may have been collected. Resulting join may
not be the best choice.
o Depending on the table size, it may be more efficient to
load the data from one table to the other when both
tables’ data type match.
Confidential Page 5 20/06/2007
While join is attempted on part of a column with substring
(e.g. 'Substring'), even if statistics have been collected for
the column, Teradata does not know the data demographics.
The advice is to check the 'Explain' output before populating
into target table. If the user is concerned about performance
then data can be loaded into a temporary table before loading
into target table.
Query Values Limitation
It is tempting to Cut and Paste values into SQL as shown
below:
Select * from ……Where column in (val1,val2,val3,
……………..valn)
However, this is inefficient because Teradata does not easily
share this work across all of its processors. Therefore it is
better to insert the data into a table and code:
Where column in (select column from Tablename);
It is recommended to use this statement when there are more
than 50 values in the query.
Constraint on Secondary Indexes
If a table requires a Secondary index, create the index after
data has been loaded into the table to ensure that the load
process is completed.
If an Update or Delete is being performed, remove all
secondary indexes before applying the change. Once the
change is completed then re-create secondary index.
Confidential Page 6 20/06/2007
Pre-requisites for table creation
Since default form of Table is 'Set' table, it is used by most of
the developers without analyzing; however a Volatile Table is
much better for temporary storage. The reason of creating
Volatile Table is the only table type which does not take
restrictive locks on the Dictionary.
Listed below are the 4 different table types and some of their
characteristics:
Set Table : Duplicate rows not allowed Teradata
default.
Multiset Table : Duplicate rows allowed.
Volatile Table : Available as long as the session is alive.
Rows exist
For the transaction when the table has been
defined as 'On Commit preserve rows'.
Statistics can not be collected on the
volatile table.
Global Temp Table : Same as Volatile concept, difference is
that Definition is
stored in Data Dictionary and data is
deleted at the end
of the session. Statistics can be collected.
Use of 'Create Table As' can be a useful to create new tables
based on the existing table(s). Be aware that Teradata will
create the new table using defaults which might not be the
way how developer actually wishes the table to be structured.
The defaults may vary from release to release. Best practice
is to explicitly state all columns attributes (NOT NULL etc) and
indexes (UNIQUE).
Confidential Page 7 20/06/2007
Test on smaller sample tables.
The amount of CPU time used to process a query is
dependent on a number of factors, but in many cases the CPU
usage will be proportional to the size of the table. If (iterative)
testing is conducted on a small extract of the live table then
the amount of CPU will be reduced.
Don't select it if you don't need it.
Processing will be more efficient if SQL excludes rows through
a specific clause in the 'Where' condition, rather than relying
on the Join condition to exclude them from the final result.
For example if some rows have all zeros in a column because
it is a special case, don't rely on them failing the join
condition to exclude them. In this particular case not only is it
inefficient to compare the contents of all the rows containing
zeros, but they can skew the data (instead of spreading the
data across all of the Teradata processors, it is concentrated
on a single processor and takes much longer to run).
Use 'Union All' instead of just 'Union'
When creating a Union of 2 set of rows, the default form of
the statement will check for the presence of duplicate rows,
which is unnecessary if duplicates are acceptable. Most of the
situations it is known that duplicates can not possibly exist,
and if they do exist then it is correct to select them.
Otherwise better go for 'Union All' which recognizes that
duplicates may exist.
Confidential Page 8 20/06/2007
Query Termination
If the query is not performing as per the expectation then
consider to abort. However, killing a job which is performing
large Updates or Inserts on a non-empty table can cause
rollbacks, which in turn cause higher CPU Consumption.
Running jobs in off peak hours
The Teradata system is less heavily used at night and at
weekends.
Using that spare capacity increases the availability of
resources during peak time.
Understanding CPU
To understand the scale of CPU related problems, it is
important to know the CPU resources that are available on
the 5350 Teradata system.
Assume, the Operational machine (NCR5350) has 24 Nodes,
and each Node has 2 CPUs. Therefore in any second there are
48 CPU secs of processing power would be available to share.
If the number of poorly written query increases, it is obvious
that inefficient SQL will take more resources, take longer to
complete, and increase the congestion - impacting everyone.
To give an example of how much of CPU can be used by a
simple and efficient query on a large table:
select count(*) from t_bcard_restdb.bc_history where
account_status_code = 'zz'; This resulted in a Table scan,
Confidential Page 9 20/06/2007
which has 667 million rows and occupies 166GB. The query
used 820 CPU secs.
Identifying CPU usage
The amount of CPU used by each user can be found from
A_Usage_base.UsageHistory, which is accessible to all users.
3. Explains Against All Queries
When developing queries, always perform “An explain” on the
query before you run it. Explains can give a lot of information the
way optimizer will execute a query.
To perform an “Explain”, simply add the explain keyword prior to
your select/ insert/ update/ delete statement and execute it.
The Explain statement is used to aid in identifying potential
performance issues, it analyses the SQL and breaks it down into
its low level process. Unfortunately the output can be very
difficult to understand for an untrained person, but there are
some points to recognize: Confidence Level and Product Joins.
Confidence Level
Teradata attempts to predict the number of rows which will result
at each stage in the processing, and will qualify the prediction
with a confidence level as shown below:
No Confidence – There is no statistics available.
Low Confidence – Statistics are difficult to use precisely.
High Confidence - Optimizer is sure of the results based on
the stats available.
Explain Select * from DB1.Table1;
It will produce output like this:
Confidential Page 10 20/06/2007
1) First, we lock DB1.Table1 for access.
2) Next, we do an all-AMPs RETRIEVE step from
DB1.Table1 by way of an all-rows scan
with no residual conditions into Spool 1, which is built locally
on the AMPs. The size of Spool 1 is estimated with high
confidence to be 141 rows. The estimated time for this step is
0.15 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs
involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the
result of
statement 1. The total estimated time is 0.15 seconds.
Whilst this output is not all that user friendly (especially for
complex queries) a number of useful things are present in the
output, including a total estimated time to run the query, the join
methods used, and the confidence levels that the optimizer has
used while calculating number of rows/elapsed time that the
query steps will take.
The best way to use “Explain" is to compare how your changes
are affecting the way a query will run. Do this by changing the
query a little at a time observing how your changes affect the
complexity and total estimated time of the “Explain”.
REMEMBER TO ENSURE STATS HAVE BEEN COLLECTED ON PI,
JOIN AND SELECTION COLUMNS!
If stats are not present for the required columns, one can notice a
number of Low Confidence estimations in the Explain. See the
statistics section of this document for further information on
Confidential Page 11 20/06/2007
checking stats have been collected, and arranging their
collection/refresh.
Visual Explain
In addition to the text explain shown above, one can also perform
visual explains of queries using the visual explain tool.
To capture a visual explain (into a special explain database) you
add the insert explain modifier to the top of your SQL.
Insert Explain into QCD as myquery1
Select * from DB1.Table1;
The above places the “Explain” into query capture database
(QCD) and gives it a name of myquery1 to easily distinguish it
from the other explains that are present in the QCD database.
Always give your queries names for this reason.
Next you must start Visual Explain 2.0.1 and connect to the
database. Once Visual Explain has started up click on the
connect button (leftmost one on the toolbar), or choose File|Logon
to connect to the database.
Once you are connected you will need to retrieve your plan
(generated above) from the QCD database. To do this click on
the Open Plans toolbar button (3rd from left), or File|Open Plans
from Database.
Confidential Page 12 20/06/2007
You will be presented with the following dialog box. Make sure
the database box reads ‘QCD’ (select it from the list or type it in),
and type in the query name you used when inserting the Explain
into Query Tag box.
Once that is done, click the ‘Browse QCD’ button and your query
should be displayed in the Available Execution Plans dialog box.
Confidential Page 13 20/06/2007
Find the one you want and click the checkbox for it, then press
the ‘Add ’ button, finally click on the ‘Open’ button.
Confidential Page 14 20/06/2007
You will then have to wait for 10-30 seconds while the explain is
loaded from the database and formatted onto the screen. When
complete the screen should look something like this:
You will notice that a summary window appears with a total
estimated cost and number of steps etc. as well as a flowchart
type plan with the various steps undertaken to run the query
(practically identical to the text based explain).
You can click on the various steps and pop-up windows will give
you more information about them. You can print queries/explains
etc. and also compare two explains to get a report of the
differences. Please refer to the Visual Explain manual for more
information.
Confidential Page 15 20/06/2007
More information on using the Visual Explain output is to be
added to this paper in the near future based on the new release.
4. CODING GUIDELINES (ANSI)
It helps code readability; quality and often performance, if queries
are written using ANSI join syntax. This makes it easy to separate
the join conditions from the row selection conditions. This gets
more important as more tables are being joined, and more
selection conditions are intermingled with the join conditions in a
where clause.
An example of ANSI join syntax is shown below:
Select tb1.Col1 tb2.Col2 tb3.Col3 tb4.Col4From Table1 tb1 inner join Table2 tb2 on (tb1.col1 = tb2.col1) inner join Table3 tb3 on (tb1.col1 = tb3.col1) inner join Table4 tb4 On (tb1.col1 = tb4.col1)Where Tb1.Col9 > 12345And Tb4.Batch_Id = 33401;
Whenever ANSI join syntax is employed to join two tables, and
there is a selection condition on one of the tables involved in the
join, you should place that condition within the ‘on’ clause of the
join concerned to help the optimiser do the filtering prior
Confidential Page 16 20/06/2007
to/during the join, rather than after the join (which is less
efficient).
An example of this is shown below:
Untuned:
Select tb1.Col1 tb2.Col2From Table1 tb1 inner join Table2 tb2 on (tb1.col1 = tb2.col1)Where Tb2.col3 = 50;
Tuned:
Select tb1.Col1 tb2.Col2From Table1 tb1 inner join Table2 tb2 on (tb1.col1 = tb2.col1 and Tb2.col3 = 50);
If both tables in a join have selection conditions it is usually
advisable to place a single selection condition in the on clause,
with the other one placed in the where clause. It is preferable to
have the condition in the on clause for the table with the most
rows.
Functions in Join Conditions
Do not put functions (e.g. cast or especially case statements) into
table columns involved in join conditions unless this is absolutely
necessary. Sometimes a derived table may be a more
appropriate solution to this kind of problem.
Confidential Page 17 20/06/2007
Datatypes in Join ConditionsJoins should be performed by one or more conditions matching
two columns with identical datatypes. If the datatypes differ,
primary indexes will not be used, and in effect a cast will need to
be performed on each value (i.e. there is a function in the join
condition – see above) before the value can be joined.
Cross Joins (Product Joins)Product joins (i.e. joining two tables without any join conditions)
are to be avoided. The only exception is in the case of joining to
a table or subquery which returns a single row. An example of a
permissible product join would be joining the rows to be added to
a TR table with the subquery that returns the current maximum
ID present in the TR table. One of the benefits of using the ANSI
join syntax is that it makes these unconstrained joins very easy to
spot since they have no on clause.
Eliminate Unnecessary JoinsIf you are running a sequence of queries in a job and you appear
to be continually using the same join to a table to restrict rows
(e.g. the join to process_batch to obtain a batch_no in staging to
target conversions), place that value in a file (or temp table) and
import the value from the file and use it as a literal in your query.
This can save a great deal of time when joining to large tables,
and also makes it easier to ensure that the work being performed
is consistent (e.g. that the batch you started processing in step 1
is the same one you are processing in the final job step).
An example concerning batch numbers is shown below:
Confidential Page 18 20/06/2007
Bad:
select tb2.Col1, tb3.Col2, tb4.Col3from Table2 tb2, Table3 tb3, Staging_Table stg, Process_Batch pbaWhere tb2.Col4 = 123 and tb2.Col1 = tb3.Col1 and stg.Col6 = tb2.Col6 and pba.batch_no = stg.batch_no and pba.process_id = 201 and pba.batch_status_cd = 'I';
Good:
/* First step of job */.EXPORT DATA DDNAME BATCHNO;select batch_no (CHAR(10))from Process_BatchWhere Process_Id = 201 and Batch_Status_Cd = ‘I’;.EXPORT RESET;
/* Typical job step */.IMPORT DATA DDNAME BATCHNO; .REPEAT 1; USING BATCHNO(CHAR(10)) Select tb2.Col1, tb3.Col2, tb4.Col3from Table2 tb2, Table3 tb3, Staging_Table stg Where tb2.Col4 = 123 and tb2.Col1 = tb3.Col1 and stg.Col6 = tb2.Col6 stg.batch_no = cast(:BATCHNO as INTEGER);
NOTE: All processing involving batch numbers in staging to
target transformations should employ this logic.
Derived Tables
Confidential Page 19 20/06/2007
Derived tables can be a powerful technique to produce efficient
queries, but they can also cause major performance problems
when used in inappropriate situations.
Here are some guidelines for the use of derived tables:
Never use a derived table to simply restrict the records of a large
table prior to joining it to some other table. Doing this prevents
the optimizer from using statistics on the table when it is
subsequently joined to another table, since the derived table is
pulled into a spool file, and this spool file will not have statistics
available to the optimizer to prepare downstream joins.
Do not use a derived table composed of a query which contains
the same tables that are joined outside the derived table unless
you have to perform aggregation or some other operation within
the derived query that cannot be performed against those tables
in the base query.
A permissible example would be a derived table which gets the
keys of the latest records in a table (e.g. max(Txn_Date)) and is
joined to the same table in the base query to get the latest
records.
Use of a derived table may be appropriate when it significantly
reduces the complexity/increases the readability of a query.
An example is the use of a derived table in from clause of an
update statement. This is the recommended way to write an
update.
General approach should be use of temporary tables instead of
derived table if the expected dataset or involved table(s) have
more then 250K records (1000 rows/AMP * 240 AMPs).
Confidential Page 20 20/06/2007
Use Derived Tables in Updates
It can significantly reduce query complexity and improve
performance and readability if updates are written with from
clause as a derived table. This is particularly useful when there are many table involved in the
query. For example
Instead of:
UPDATE TB1FROM
Table1 TB1, Table2 TB2, Table3 TB3, Table4 TB4
SET TB1.COL3 = TB4.COL3
WHERE TB1.COL1 = TB2.COL1AND TB1.COL2 = TB3.COL2AND TB2.COL1 = TB3.COL1AND TB2.COL1 = TB4.COL1AND TB2.COL4 = 123 ;
This is preferred:
UPDATE Table1 FROM( SELECT
TB2.COL1, TB3.COL2, TB4.COL3
FROMTable2 TB2
INNER JOINTable3 TB3
ON TB2.COL1 = TB3.COL1
INNER JOINTable4 TB4
ON TB2.COL1 = TB4.COL1
WHERE TB2.COL4 = 123) XXX
SET Table1.COL3 = XXX.COL3
WHERE Table1.COL1 = XXX.COL1AND Table1.COL2 = XXX.COL2;
Whilst the code may be longer, the select which obtains the
rows to be updated can be prototyped in isolation of the update,
and therefore can also be individually tuned separate to the
update. Logic errors that might be hidden in an all-in-one
update statement become much more visible when written as a
derived query.
Confidential Page 21 20/06/2007
Derived tables can be a powerful technique to produce efficient
queries, but inappropriate usage can cause performance
problems.
Guidelines for the use of derived tables:
Use of a derived table may be appropriate when it
significantly reduces the complexity/increases the readability
of a query. An example is the use of a derived table in from
clause of an update statement. This is the recommended
way to write an update ( Use Derived Tables in Updates
above).
The general approach should be use of temporary tables
instead of derived table if the expected dataset or involved
table(s) have more then 250K records (1000 rows/AMP * 240
AMPs).
Temp TablesThe following three types of temporary tables can be used in
transformations in order to hold intermediate results:
Use Derived / Temp Tables in Updates
It can significantly reduce query complexity and improve
performance and readability if updates are written with from
clause as a derived table.
Not advisable:
Update tb1From Table1 tb1, Table2 tb2, Table3 tb3, Table4 tb4Set
Advisable:
update Table1 from ( select tb2.Col1, tb3.Col2, tb4.Col3 from
Confidential Page 22 20/06/2007
Tb1.Col3 = tb4.Col3Where Tb1.Col1 = tb2.Col1and tb1.Col2 = tb3.Col2and tb2.Col1 = tb3.Col1and tb2.Col1 = tb4.Col1and tb2.Col4 = 123;
Table2 tb2 Inner join Table3 tb3 on (tb2.Col1 = tb3.Col1) inner join Table4 tb4 on (tb2.Col1 = tb4.Col1) where tb2.Col4 = 123) xxxset Table1.Col3 = xxx.Col3where Table1.Col1 = xxx.Col1and Table1.Col2 = xxx.Col2;
Whilst the code may be longer, the select which obtains the rows
to be updated can be prototyped in isolation of the update, and
therefore can also be individually tuned separate to the update.
Logic errors that might be hidden in an all-in-one update
statement become much more visible when written as a derived
query.
Taking into account limitations of current ETI version in
generating updates from derived tables, a temporary table(s)
should be used for updates:
insert into tmp_tbl1select tb2.Col1, tb3.Col2, tb4.Col3 from Table2 tb2 Inner join Table3 tb3 On (tb2.Col1 = tb3.Col1) Inner join Table4 tb4 On (tb2.Col1 = tb4.Col1) Where tb2.Col4 = 123;
Confidential Page 23 20/06/2007
update Table1 from tmp_tbl1set Table1.Col3 = tmp_tbl1.Col3where Table1.Col1 = tmp_tbl1.Col1and Table1.Col2 = tmp_tbl1.Col2;
delete tmp_tbl1;
It should be mentioned here, that both temporary and target
tables should have the same primary index. If population of a
temporary table and update of the target table can be performed
within one session then global temporary tables with NO LOG
option can be used, otherwise temporary table rows should be
deleted in the end of the job step.
Updating Most Rows in a TableIf a query updates a significant proportion of the rows in a table
(especially a large one) it is likely that the query should be
redesigned, since activity of this type generally performs poorly.
Redesign of this kind of query to an insert into an empty
temporary table from a select query often yields performance
improvements of an order of magnitude or greater. Once the
insert into temporary table is complete, original can be swapped
and temp table renamed as the original.
Correlated SubqueriesCorrelated subqueries (i.e. subqueries that reference a column in
a table from the base query), if specified on inequality conditions,
are generally an extremely inefficient way to process large
Confidential Page 24 20/06/2007
datasets. They perform row by row operations which can be very
high cost and are rarely an appropriate solution. In general these
are to be used in conjunction with (NOT) EXISTS predicate for
performing exclusion merge joins.
In/Not In SubqueriesQueries which do delta processing through the use of NOT IN
subqueries should generally be written as left outer joins or
exclusion joins with NOT EXISTS predicate. Non-delta processing
may make use of subqueries to provide the not-in list, but this
should only be performed if the list of values returned by the
subquery is reasonably low (e.g. less than 250K). If the expected
dataset of the subquery is greater then 250K then NOT EXISTS
should be used.
Bad:
Insert into Table1 ( Col1, Col2)Select Tb2.Col1 Tb2.Col2From Table2 tb2Where Tb2.Col1 not in ( Select Col1 from Table1 );
Good:
Insert into Table1 ( Col1, Col2)Select Tb2.Col1 Tb2.Col2From Table2 tb2 Left outer join Table1 tb1 On (tb1.Col1 = tb2.Col2)Where Tb1.Col1 is null;
Queries which use IN subqueries should generally be written as
inner joins, except where the subquery returns one, or very few
records.
Confidential Page 25 20/06/2007
NOTE: Only write in subqueries as inner joins if the table column
in the subquery is unique. If the column(s) is non-unique
duplicate records will be introduced, corrupting the query results.
Bad:
Select Tb2.Col1 Tb2.Col2From Table2 tb2Where Tb2.Col1 in ( Select Col1 from Table1 );
Good:
Select Tb2.Col1 Tb2.Col2From Table2 tb2 Inner join Table1 tb1 On (tb1.Col1 = tb2.Col2);
Implement Distinct as Group By
Often the way Teradata performs a select distinct produces large
spool files and results in a less efficient plan than using a group
by to perform the same task. It is preferable to code your
DISTINCT as GROUP BY as follows:
Less Desirable:
Select Distinct Tb2.Col1, Tb2.Col2From Table2 tb2;
More Desirable:
Select Tb2.Col1, Tb2.Col2From Table2 tb2Group by Tb2.Col1, Tb2.Col2
Confidential Page 26 20/06/2007
Tester Queries
Be specific when checking transformation results, When checking
the records contained within a staging table have been correctly
applied to a target table (esp. large staging/target tables) do not
join the tables together to check this. Joining the tables will
produce large result sets (often millions of rows), and you will
only ever eyeball a few of these records, so why do the join?
A much better way of performing this check is to select a sample
of the records that were added/ modified by the staging to target
job from the target table. To do this simply do a SELECT against
the target table qualifying the batch number to be the one just
loaded. Add the sample modifier to your query to return a
random sample of say 10 rows (TIP: Use sample whenever you
want to get a representative set of records from a large table).
Select * from DB1.TABLE1Where Load_Batch_No = 33400SAMPLE 10;
This will return a random sample of 10 rows that should have
corresponding values present in the staging table. To check the
contents of target table against staging table, run a series of
short simple queries against source table to see what was in the
corresponding staging table records, substituting in the
appropriate key values from results of SELECT performed above.
Select * from DB1.STG1Where ADDR_LN_ONE = ‘some data from the corresponding field in the target table’
Confidential Page 27 20/06/2007
And ADDR_LN_TWO = ‘some data from the corresponding field in the target table’
And ADDR_LN_THREE = ‘some data from the corresponding field in the target table’;
You then check that appropriate columns in the second query
(against staging table) match those in the first query (against
target table). This will be much faster than joining the tables – let
yourself do the join on only those records that matter!
5. CODING PROCESS
CURRENT_DATE and DATE
Confidential Page 28 20/06/2007
CURRENT_DATE is to be used in SQL in preference to DATE.
CURRENT_DATE is ANSI compliant, and whilst DATE is supported
by Teradata, it is only currently supported for backwards
compatibility (may not be supported in future releases).
ROW_NUMBER and CSUM
SQL function ROW_NUMBER is to be used in preference to
CSUM. ROW_NUMBER is ANSI compliant, and whilst CSUM is
supported by Teradata, it is only currently supported for
backwards compatibility (may not be supported in future
releases).
ANSI RANK and Teradata RANK
SQL window function RANK is to be used in preference to
Teradata RANK function. The ANSI RANK window function is
recommended by Teradata in preference to the non ANSI
Teradata RANK function. The latter is only currently supported
for backwards compatibility (may not be supported in future
releases).
Functions & Case in Join Conditions
Avoid using Case statements and functions in table join
conditions, unless it is absolutely necessary. If the tables being
joined contain unmatched data types, rather than using the
CAST function in the join condition, a derived table or a
temporary table into which the column is first cast may be a
more appropriate solution to this kind of problem.
Confidential Page 29 20/06/2007
Datatypes in Join Conditions
Wherever possible, columns involved in join conditions should
be similar in data type. For example, Character to Character
comparisons (of any length, and regardless of whether they are
VARCHAR or CHAR) is acceptable. Numeric to numeric (such as
INTEGER to DECIMAL) is also acceptable. Date to Date, Time to
Time, etc… If the columns are not similarly typed, the SQL
parser will throw in a CAST function. The CAST operation may
or may not make logical sense for the required purpose, and
additionally there will be a performance degradation (see
Functions & Case in Join Conditions above).
When there is a choice, it is preferable to use numeric columns
in join conditions, as character comparisons are always more
CPU intensive.
Join optimisation When possible, columns involved in the join conditions should be the primary indices of
each of the tables involved in that join. This is easiest to adhere to when using intermediary
temporary tables as it is then that the developer has maximum control of index choice. If it
is not possible to choose the primary index of both tables involved, it still helps if at least
one of the tables in the join has an index which best fits that join.
For example, assume we are aware the following SQL is required:
SELECTTB1.COL1
, TB2.COL2FROM PREFIX1.TABLE1 TB1INNER JOIN PREFIX2.TABLE2 TB2
ON TB1.COL1 = TB2.COL2AND TB1.COL3 = TB2.COL4;
We should then strive to have corresponding table definitions as follows:
CREATE SET TABLE PREFIX1.TABLE1, NO FALLBACK,
Confidential Page 30 20/06/2007
NO BEFORE JOURNAL,NO AFTER JOURNAL(
.
.
.)
PRIMARY INDEX(COL1, COL3);
and:
CREATE SET TABLE PREFIX1.TABLE1, NO FALLBACK,NO BEFORE JOURNAL,NO AFTER JOURNAL(
.
.
.)
PRIMARY INDEX(COL2, COL4);
Additionally, if the indices are unique primary indices,
performance will be further enhanced.
Avoid too many joins, in particular when at least one of the
tables is large (eg. history tables). It is better to break such
a query up into several temporary table populations, where
the number of joins can be minimised, and statistics are
collected along the way.
Index Selection
Index selection for the final target table will generally be driven
by the expected end-user usage of the table. This may not
necessarily equate to load optimisation (which should be a
secondary consideration). However, the developer should
always consider load optimisation when creating Temporary
tables.
Primary index choice for faster data loading:
Confidential Page 31 20/06/2007
1. Choose an index composed of the columns which will be
used in subsequent joins (see Join optimisation above)
2. Choose an index which gives as high level of uniqueness as
possible. This helps Teradata to better distribute the work
load across the AMPS.
3. Use the following query (preferably against production data) to see how well (or not well)
distributed a given index is:
SELECTHASHAMP(
HASHBUCKET(HASHROW( column1 , … , columnI , … ,
columnN))) ,COUNT(*)
FROM PREFIX.TABLEGROUP BY 2
Where: column1 is the first column of the chosen index,columnI is the Ith column of the chosen index,columnN is the Nth column of the chosen index,PREFIX is the table prefix for the table whose index is
being prototyped, andTABLE is the table whose index is being prototyped.
A result set will be produced where the first column is an AMP
identifier, and the second column is how many rows from the
table would be stored on that AMP given the chosen index. It
is also possible to prototype a non-existent table by using the
above functions on the subquery, where the subquery is the
SQL which would otherwise be used to populate the non-
existent table.
Use the above result to check for bad distribution (also
known as skewed distribution).
Confidential Page 32 20/06/2007
6. STATISTICS
In order for the optimiser to produce a good query plan statistics
must be collected on all tables and columns involved in a query
as per the following list:
Primary Indexes of all tables in a query
All columns involved in join conditions
All columns involved in row selection conditions (where
clause)
When processing SQL that joins 2 or more tables, Teradata's
choice of join plan is totally dependent on its knowledge of the
values of the data in the columns referenced in the SQL.
Statistics must be collected:
On any column which is involved in the Join condition.
On any column which is part of the 'WHERE' condition in the
SQL.
After the data has been loaded, or reloaded, or significantly
updated.
If Statistics are not collected or are not current, and the wrong
plan is used by Teradata, then many thousands of CPU secs can
be used instead of a few hundred. The elapsed time of queries is
Confidential Page 33 20/06/2007
frequently reduced from hours to minutes through judicious
collection of statistics.
To determine if stats have been collected use the help statistics
command as per the below example:
help statistics DB1.TABLE1;
*** Help information returned. 11 rows.
*** Total elapsed time was 1 second.
Date Time Unique Values Column Names
-------- -------- -------------------- ------------------------------------
02/07/30 02:27:15 63,677,477 TB_Id
02/07/30 02:27:20 17 Tp_Cd
02/07/30 02:28:53 39,595,243 Cust_id
02/07/30 02:28:59 15,135 Cust_Org_Unit
If statistics you require for your query are not listed, contact the
DBA group and let them know exactly what table/column/index
requires stats to be collected and they will perform the collection
and get back to you.
Please remember that statistics are ESSENTIAL for the efficient
execution of SQL, and this should ALWAYS be the first thing
checked when developing or tuning a query.
To determine if statistics have been collected use the help statistics command as per the below
example:
Help statistics PREFIX.party;
*** Help information returned. 11 rows.*** Total elapsed time was 1 second.
Date Time Unique Values Column Names-------- -------- -------------------- ------------------------------02/07/30 02:27:15 63,677,477 cust_Id02/07/30 02:27:20 17 Tp_Cd
Confidential Page 34 20/06/2007
02/07/30 02:28:53 39,595,243 cust_id02/07/30 02:28:59 15,135 cust_Org_Unit02/07/30 02:29:08 182 Carr_Cd02/07/30 02:29:13 144 Ctry_Cd02/07/30 02:29:55 3 Batch_run
If required statistics for a given query are not listed, it may be
necessary to execute an appropriate collect statistics statement
(see below).
Collect Statistics specifying columns/indices (“explicit”), some examples:
COLLECT STATISTICS ON PREFIX.TABLECOLUMN Col1;
COLLECT STATISTICS ON PREFIX.TABLECOLUMN (Col1, Col2);
COLLECT STATISTICS ON PREFIX.TABLEINDEX (Col2, Col3, Col4);
This type of collect statistics statement must run against a
table before the “implicit” type (below) can be executed. It will
not only collect the nominated statistics for the specified table,
but will also store information in Teradata which allows for the
“implicit” collect statistics to run.
Collect Statistics without specifying columns/indices (“implicit”):
COLLECT STATISTICS ON PREFIX.TABLE;
This type of collect statistics assumes that all of the columns
that were individually nominated in previous collect statistics
statements are to have statistics re-collected.
Please remember that statistics are ESSENTIAL for the
efficient execution of SQL, and this should ALWAYS be
the first thing checked when developing or tuning a
query.
Confidential Page 35 20/06/2007
7. PERFORMANCE OPTIMIZATION
Partition Primary Index Advantage
It increases the query efficiency by avoiding full table scans
without the overhead and maintenance costs of secondary
indexes.
For example, assume a sales data table has 5 years of sales
history.
- A PPI is placed on this table which partitions the data into
60 partitions (one for each month of the 5 years).
A user executes a query which examines two months of sales
data from this table With a PPI, this query only needs to read 2
partitions of the data from each AMP. Only 1/30 of the table has
to be read.
Confidential Page 36 20/06/2007
Range queries can be executed on tables without secondary
indexes. With previous Teradata releases, a value ordered NUSI
can be used to help increase performance of queries that
qualify ranges or rows, but they entail NUSI subtable perm
space and maintenance overhead. The more partitions there
are, the greater the potential benefit.
Potential Disadvantage
A query specifying a PI value, but no value for the partitioning
column, must look in each partition for that value.
When joining, if one of the tables is partitioned, the rows won't
be ordered the same, and the task, in-effect, becomes a set of
sub-joins, one for each partition of the PPI table.
The disadvantage is proportional to the number of partitions,
with fewer partitions being better than more partitions.
Access with and Without Partitioning
Confidential Page 37 20/06/2007
8. Conclusion
Confidential Page 38 20/06/2007
The paper highlights the significance of best practice methodologies
of TERADATA and it is thoroughly evaluated to do away with the high
maintenance cost of re-writing scripts. The options have been
examined across the factors and the best option has been given.
Confidential Page 39 20/06/2007