Self Reference Teradata

1)what is the adv of Soft RI to normal RI ?

-> index subtable is not required

2)what if RI is defined on two populated tables ? -> violating rows are put into error table..

3)While inserting a row in child table insert fails as the value is not found in parent table.. what can be done to make the insert successful? -> make the value to null -> insert it in parent table -> get a value which is present in Parent table

5) Update fails.... what happens??? -> ins stmt will be rolled back ->update will be rolled back. -> locks will be released after processing rollback of update -> locks will not be release...

6)what is joinback?? ->join index is joined back to base table.

7)summary tables.aggr join index.

8)what is true abt global temp tables ? -> uses temp space of user logged in -> can use WITH DATA option -> Dic info is stored DD

9) Generate always identity col... max 100 min 1

-> cannot insert more than 100 rows.

10) sel * from t1 where upi=1; sel * from t1 where upi=2;

11)T1 rite join T2 Left join T3

12) What are inner tables and what are outer tables?

13) Triggers Order clause when does it use? -> Triggering action should be same -> Triggered stmts shoud be same -> WHEN condition must be same

14)order - -> LDM,ELDM,PDM,Designing app, developing app, assurance testing...

15)denormalization comes into picture in -> LDM,ELDM,PDM,

16)match following CSUM, remaining sum, moving sum.. 3 SQLs.. sum(a1) rows 2 preceding sum(a1) unbounded preceeding rows sum(a1) unbounded following rows..

17)when does opt goes for nested join ?

18)what are join types ? cross join, inner join , outer join

19)sel * from dept where mgr not in (sel mgr from employee );mgr in dept is null.. what is the result of the above query ??

20)TWB ??? automatic restart, multiple tables can be loaded, data can be read from multiple sources..

21)adv of NO CHECK option over WITH CHECK OPTION in RI ??? ->index sub table is not built... -> Join eliminations possible (in both cases is possible)

22)update cursors are allowed in -> Stored procedures -> PP2

23)locking t1 for write sel * fromt t1 where upi=5 will place ->write lock on table -> write lock on row hash

24) NUSI on single col stats are stored in -> Dbc.tvfields

25) which two restrict users access to db object ?

26) create procedure( a in, b inout, c out)which of the following are true ??

c = a+1 b = c+1 c = a+b

b = a+1

rows/NUPI ??

-> Stored procedures ->

Development Life Cycle

LDM OverviewA Logical Data Model (LDM) is a relational representation of a business enterprise.The logical data model is a collection of two-dimensional, column-and-row tables that represent the real world business enterprise.

LDM—NormalizationNormalization is a technique for placing non-key attributes (columns) into tables in order to minimize redundancy, provide flexibility, and eliminate update anomalies.

Process of designing the databaseTesting the design or modelConsists of three main forms (rules), to which tables must

adhere

First Normal Form Eliminate the practice of “repeating groups.” The relationship of the column to the Primary Key should be one to one. PK should be unique.If there are multiples in the column, you need to reconsider your choice of Primary Key.The solution here is to find a more definitive PK and redescribe the table in this manner, so each occurrence is in its own row.

Attributes must not repeat within a table.No repeating groups.1:1 relationship of column to PK.

The rule of Second Normal Form is not necessary in all cases. It only comes into play when a table has a multicolumn Primary Key. In this situation, focus your attention on the combination of columns in the PK.If the placement came as a result of only one column of the PK, then these columns must be removed. The result here is that you may be establishing a new entity. Remember that the data column is important and you don’t want to just throw it away.

An attribute must relate to the entire Primary Key, not just a portion.

Tables with a single-column Primary Key (entities) are always in Second Normal form.

The final test for the three forms of Normalization is a test of whether or not each data column is directly related to the Primary Key of the table, or perhaps the column is related to another column in the table.This situation usually happens when we have included both the identification code for a set of data and it’s description. Sometimes we do this to keep data together. But in doing so we often lose the value of theinformation. For instance if we keep the job code of the employee along with the job description (note that these two columns are related to each other), so that if one changes both need to change. When there are no employees with this job code we lose the definition of the job and it’s description. They should be kept in their own table for reference. This situation may uncover minor entities that you have forgotten.

Attributes must relate to the Primary Key and not to each other.Cover up the PK and any No Duplicate columns; remaining

columns must not describe each other.

Issues with Normalization

If you normalize to first normal form, the table will have many more rows and you will have to do many UNIONS or JOINS or CASE operations to get viable reports. When the data is not in the first normal form it is limited to pre-planned occurrences.

The impact of implementing at second normal form is that the data is now separated and therefore you will have to do more joins to get the data in presentable form. When the data is not in second normal form, then you have anomalies in the update process on the tables. .

Third normal form is again the necessity of doing additional joins to get the data for presentation. The impact of not going to 3NF is the potential of missing business data because of theimprobable location of minor entities.

Benefits of normalization• 3NF can accurately model business relationships.• Supports ability to ask any question (ad hoc), not just known questions.• Supports function of the warehouse as repository of detail that is needed to offer complete analysis against any data at any time.

Repeating Groups are attributes in a denormalized table that would be rows if the table were normalized.

Temporary tables are created for specific purposes, but are not a part of the Logical Model. As such, they can be denormalized. This will provide performance benefits while keeping the Logical Model pure.

Tables can be divided into sub-entity tables with fewer columns.Many SQL queries will run faster against the sub-entities. It is more difficult to do FDLs, INSERTs and DELETEs since you have to deal with several tables instead of just one. UPDATEs mayalso require more effort.

Teradata stores password information in encrypted form in the DBC.DBase system table.The PasswordString column from the DBC.DBase table displays encrypted passwords.The DBC.Users view displays PasswordLastModDate and PasswordLastModTime.

DBC.SysSecDefaults has all the pre-requisites for the password given to an user

GRANT/REVOKE LOGON Statements: DBC.LogonRule …..macro

User Privileges (Access Rights)Teradata stores access rights information in the system table DBC.AccessRights.

The Extended logical data model is an “extension” to the original logical data model and is used by the physical database designer to select indexes.

Information provided by the ELDM results from user input about transactions and transaction rates.

Value Access Frequency: how often the table is accessed based on a specific value for this column (Or group of columns).

Value Access R * F (Rows * Frequency): how many rows will be accessed multiplied by the frequency of access.

Join Access Frequency: how often the table is joined to another table based on the value that appears in this column (or group of columns).

Join Access Rows: how many rows will be joined multiplied by the frequency of access. Distinct Values: number of different values that can be found in this column. Maximum Rows per Value: number of rows that can be found for the value occurring most

often in this column. (Nulls are not included in this entry.) Rows per Null Value: number of rows that can be found having a NULL value in this

column.

Typical Rows per Value: number of rows that can be found for a typical value in this column.

Change Rating: A relative rating for the frequency with which the values in this column change.

PDM Overview

Teradata's parallelism is never more apparent than when performing queries involving set manipulation.

Another way to take advantage of the parallelism, and flexibility, of Teradata, is by consolidating data on Teradata instead of preprocessing it on the client or host side prior to data loads. Load it first and process it later on Teradata!

For the Teradata RDBMS, a PDM is designed:Based on the LDM, to retrieve and manipulate tables, rows, and columns as physical objects.To provide optimal performance using Teradata.Identifies index structures as access paths to the data.

– Primary Index» Mandatory and only one allowed per table.» Can contain up to 16 columns.» Chosen for either access or row distribution.» Can be unique (UPI) or Non-Unique (NUPI).» Query access is always a one AMP operation.» Can contain NULLs.» Values can be updated.

– Secondary Index» Optional and can be as many as 32 to a table.» Can contain up to 16 columns each.» Can be Unique (USI) or Non-Unique (NUSI)» Chosen for improved access to data.» Stored as a sub-table.» Can be created/dropped dynamically.» Can contain NULLs.» Can be used to enforce uniqueness.

Need not be based on the LDM. ("De-normalization") However, conforming to these rules is strongly advised.Relationships between tables are seen only through queries on them—an FK is not required for joins.The Teradata RDBMS is extremely flexible in allowing you the freedom to perform any kind of processing at any time!When developing applications, use set manipulation when possible to maximize parallelism.

LDM = Logical Data ModelELDM = Extended Logical Data ModelPDM = Physical Data ModelETL = Extract, Transformation, and Load

Working with the DBAsCapturing SQLIn some cases, it may be helpful to capture SQL. You or the DBA can do this by:

Using access loggingModifying TDP exit routines

Data Collection for the Application Developer: The Basics

Periodic Collection of Disk Summary by Workload GroupCurrent, Peak and Max table statistics for Perm, Spool and Temp tables, found in DBC.DiskSpace.Note: Peak statistics, by default, accumulate from day 0, and herefore have no real value. However, if you reset peak statistics every hour, you will see variations in peak statistics from hour to hour, day to day, etc.Periodic Collection of Table GrowthMbytes and number of rowsCollect User Logons per day per workloadCollect and summarize application User logon/logoff history found in DBC.LogonOffMapped to query throughputs for predictions based on knowing user logon increases.

PP2—What is a Preprocessor?

The Teradata Preprocessor is a precompiler for programs that access the Teradata database.

The precompiler is an entity that parses application source code and replaces all embedded SQL it finds with calls that are acceptable to the native compiler for the host language.

The preprocessor is the precompiler plus the services that execute, or provide runtime support for, the compiled application. It is a runtime module that is used at program execution.

The Preprocessor logs on to Teradata, syntax checks your SQL statements,verifies table names, columns, and access rights with the Data Dictionary.

The Preprocessor builds code in data division for data elements. It builds code in procedure division to handle passing your SQL statements to Teradata. It will comment out the source code it uses, and it outputs the COBOL source for input to the compiler.

There are two transaction modes for which a program may be precompiled:COMMIT and BTET

Using the COMMIT keyword allows you to break your program into multiple transactions.The Begin Transaction and End Transaction statements are placed to group small

transactions into larger ones by nesting them.

What is Dynamic SQL?Dynamic SQL means the program dynamically builds SQL requests at runtime. The Preprocessor doesn't see the SQL requests until runtime.

As static SQL has limited parameter capabilities dynamic SQL helps in overcoming this.

How Does CLI Work?CLIv2 sends requests to the Teradata server, and provides the application with a response returned from the server.

CLIv2 provides support for: Managing multiple serially-executed requests on a session Managing multiple simultaneous sessions to the same or different Teradata servers. Using cooperative processing that the application can perform operations on the client and

the Teradata server at the same time. Communicating with two-phase commit coordinators for CICS and IMS transactions. Generally insulating the application from details in communicating with a Teradata server. CLI minimizes overhead and allows developers to create applications with efficient

interfaces.

Parser

DDL Requests are not cached because they are not considered repeatable. For example, you would not be able to repeat the same CREATE TABLE Request.

Parser parses the SQL statement and prepares AMP steps to execute the request. It als provides cache for the SQL and amp exec plan.

There are two important reasons to use Macros whenever applicable: Macros reduce parcel size, thus dramatically improving performance. Macros will increase the likelihood of matching the R-T-S cache because users won't have

to re-enter their SQL.

If an identical Request exists in Request-To-Steps cache:– Call SECURITY and GNCAPPLY and pass them the memory address of the AMP steps.– These steps do not have DATA parcel values bound into them; they are called Plastic steps.

Otherwise, the Request Parcel passes the request to the SYNTAXER.The larger the Request Parcel, the longer these steps take.Macros reduce parcel size, dramatically improving performance.

Request-To-Steps Cache Logic

The entire Request Parcel is put through the hashing algorithm to produce a 32-bit hash of the parcel. If there is an entry in R-T-S cache with the same hash value, the system must do a byte-by-byte comparison between the incoming text and the stored text to determine if a true a match exists. The larger the size of the Request Parcel, the longer these steps takes.

SyntaxerThe Syntaxer checks the syntax of an incoming Request parcel for errors. If the syntax is correct, the Syntaxer produces an initial Parse Tree, which is then sent to the Resolver.

Resolver

The Resolver takes the initial Parse Tree from the Syntaxer and replaces all Views and Macros with their underlying text to produce the Annotated Parse Tree. It uses DD information to "resolve" View and Macro references down to table references. The DD tables shown in the diagram on the right-hand page (DBase, Access Rights, TVM, TV Fields and Indexes) are the tables that the Resolver utilizes for information when resolving DML requests.

Nested Views and Macros can cause the Resolver to take substantially more time to do its job.The nesting of views (building views of views) can have a very negative impact on performance. At one site, what a user thought was a two-table join was actually a join of two views which were doing joins of other views, which were doing joins of other views, which were doing joins of basetables. When resolved down to the table level, the two “table” join was really doing a 12-table join. The data the user needed reside in a single table.

DBC.Next is a DD table that consists of a single two-column row. used to assign a globallyunique numeric ID to every Database/User, Table, View and Macro. DBC.Next always contains the next value to be assigned to any of these. Think of the two columns as counters for ID values.

The DD keeps track of all SQL names and their numeric IDs. The RESOLVER uses the DD to verify names and convert them to IDs. The AMPs use the numeric Ids supplied by the RESOLVER.

DD Cache

The dictionary cache is a buffer in parsing engine memory that stores the most recently used dictionary information. These entries, which also contain statistical information used by the, are used to convert database object names to their numeric IDs.

The statement, or request, cache stores successfully parsed SQL requests so they can be reused, thereby eliminating the need for reparsing the same request parcel. The cache is a PE buffer that stores the steps generated during the parsing of a DML statement The statement cache is checked at the start of the parsing process, before the Syntaxer step, but after the Request parcel has been checked for format errors. If the system finds a matching cache request, the Parser bypasses the Syntaxer, Resolver, Optimizer, and Generator steps, performs a security check (if required), and proceeds to the Apply stage.

The DD Cache is part of the cache found on every PE. It stores the most recently used DD information including SQL names, their related numeric IDs and Statistics.

The DD tables that provide the information necessary to parse DML requests are:• DBase• TVM• Access Rights• TVFields• Indexes.

OptimizerThe Optimizer analyzes the various ways an SQL Request can be executed and determines which is the most efficient. It acts upon the Annotated Parse Tree after Security has verified the permissions and generates an Optimized Parse Tree.

DD operations replace DDL statements in the Parse tree.• The OPTIMIZER evaluates DML statements for possible access paths:

– Available indexes referenced in the WHERE clause.

– Possible join plans from the WHERE clause.– Full Table Scan possibility.

• It uses COLLECTed STATISTICS or dynamic samples to make a choice.• It generates Serial, Parallel, Individual and Common steps.• OPTIMIZER output passes to either the Generator or the EXPLAIN facility.• Additional steps are needed for Check Constraints and Triggers.

GeneratorThe Generator acts upon the Optimized Parse Tree from the Optimizer and produces the Plastic Steps. Plastic Steps do not have data values from the DATA parcel bound in, but do have hard-coded literal values embedded in them.Plastic Steps produced by the Generator are stored in the R-T-S Cache unless a request is not cacheable.

Parser Summary

Plastic Steps for Requests with DATA parcels are cached immediately.Views and Macros cannot be nested beyond eight levels.Nested Views and macros take longer for initial parsing.Multi-statement requests (including macros) generate more Parallel and Common steps.Execution plans remain current in cache for up to four hours.DDL “spoiling” messages may purge DD cache entries at any time.Requests against purged entries must be reparsed and re-optimized.

Primary Index Choice Criteria

There are three Primary Index Choice Criteria: DistributionDemographics, Access Demographics,

Volatility. They are listed in the order that they should be considered when selecting a Primary index.

Access demographics . Access columns are those that would appear (with a value) ina WHERE clause in an SQL statement. Choose the column most frequently used for access to maximize the number of one-AMP operations. Examples of access demographics are join accessfrequency and value access frequency.

Distribution demographics The more unique the index, the better the distribution.Optimizing distribution optimizes parallel processing .In choosing a Primary Index, there is a trade-off between the issues of access and distribution. The most desirable situation is to find a PIcandidate that has good access and good distribution. Many times, however, index candidates offer great access and poor distribution or vice versa. When this occurs, the physical designer must balance these two qualities to make the best choice for the index.

An important point to note here is that even distribution of data will not only speed up processing, but will also avoid premature 2644 (“Out of PERM space”) and 2646 (“Out of SPOOL space”) errors. Although the PERM space limit for a database may be defined as 50 GB, the reallimit is 50 GB divided by the number of AMPs in the system. Assume that the system has 50 AMPs, even though this database may have plenty of spare space, as soon as you try to use more than 1 GB of PERM space on any single AMP, Teradata will reject the INSERT andreturn a 2644 error code. (The same considerations apply to SPOOL space).

volatility, how often the data values will change. The Primary Index should not be very

volatile. Any changes to Primary Index values may result in heavy I/O overhead, as the rows themselves may have to be moved from one AMP to another. Choose a column with stable data values.Note: If you find that one AMP has higher CPU usage than other AMPs, and that data is highly skewed, this may be evidence of a poor choice of primary index.

Primary Indexes (UPI and NUPI)A Primary Index may be different than a Primary Key.Every table has only one Primary Index.A Primary Index may contain null(s).Single-value access uses ONE AMP and typically one I/O.Unique Primary Index (UPI)Involves a single base table row at most.No spool file is ever required.The system automatically enforces uniqueness on the index value.Non-Unique Primary Index (NUPI)May involve multiple base table rows.A spool file is created when needed.Duplicate values go to the same AMP and the same data block.Only one I/O is needed if all the rows fit in a single data block.Duplicate row check for a SET table is required if there is no USI on the table.

Multi-Column Primary IndexesAdvantage

More columns = more uniquenessDistinct value increases.Rows/value decreases.Selectivity increases.Better distribution

DisadvantageMore columns = less usabilityPI can only be used when values for all PI columns are provided in SQL statement.Partial values cannot be hashed.

Good Distribution Demographics for a Primary Index CandidateColumn Distribution demographics are expressed in four ways: Distinct Values Maximum Rows per Value Maximum Rows NULL TypicalRows per Value.

Distinct Values is the total number of different values a column contains. For PI selection, the higher the Distinct Values (in comparison with the table row count), the better. Distinct Valuesshould be greater than the number of AMPs in the system, whenever possible. We would prefer that all AMPs have rows from each TABLE.Maximum Rows per Value is the number of rows in the most common value for the column or columns. When selecting a PI, the lower this number is, the better the candidate. For a column or columns to qualify as a UPI, Maximum Rows per Value must be 1.Maximum Rows NULL should be treated the same as Maximum Rows Per Value when being considered as a PI candidate.Typical Rows per Value gives you an idea of the overall distribution which the column or columns would give you. The lower this number is, the better the candidate. Like Maximum Rows per Value, Typical Rows per Value should be small enough to fit on one data block.

Secondary Indexes

Secondary Indexes are generally defined to provide faster set selection.The Teradata RDBMS allows up to 32 SIs per table.

Secondary Index values, like Primary Index values, are input to the Hashing Algorithm.As with Primary Indexes, the Hashing Algorithm takes the Secondary Index value and outputs a Row Hash. These Row Hash values point to a subtable which stores index rows containing the base table SI column values and Row IDs which point to the row(s) in the base table with the corresponding SI value.The Teradata RDBMS can tell whether a table is a SI subtable from the Subtable ID, which is part of the Table ID.

Subtables store the row hash of the base table secondary index value, the column values, and the Row ID to the base table rows.

Users cannot access subtables directly.

Secondary Index Considerations

Sis require additional storage to hold their subtables. In the case of a Fallback table, the SI subtables are Fallback also. Twice the additional storage space is required.Sis require additional I/O to maintain these subtables.

The Optimizer may choose to do a Full Table Scan rather than utilize the NUSI in two cases:When the NUSI is not selective enough.When no COLLECTed STATISTICS are available.

As a guideline, choose only those rows having frequent access as NUSI candidates. After the table has been loaded, create the NUSI indexes, COLLECT STATISTICS on the indexes, and then do an EXPLAIN referencing each NUSI. If the Parser chooses a Full Table Scan over using the NUSI, drop the index.

Secondary Index Subtables

It compares and contrasts examples of Primary (UPIs and NUPIs), Unique Secondary (USIs) and Non-Unique Secondary Indexes (NUSIs).

Primary Indexes (UPIs and NUPIs)As you have seen previously, in the case of a Primary Index, the Teradata RDBMS hashes the value and uses the Row Hash to find the desired row. This is always a one-AMP operation and is shown in the top diagram on the right-hand page.Unique Secondary Indexes (USIs)An index subtable contains index rows, which in turn point to base table rows matching the supplied index value. USI rows are globally hash distributed across all AMPs, and are retrieved using the same procedure for Primary Index data row retrieval. Since the USI row is hash-distributed on different columns than the Primary Index of the base table, the USI row typically lands on an AMP other

USER, 07/07/09,

What is fall back??

than the one containing the data row. Once the USI row is located, it "points" to the corresponding data row. This requires a second access and usually involves a different AMP. In effect, a USI retrieval is like two PI retrievals:Master Index - Cylinder Index - Index BlockMaster Index - Cylinder Index - Data Block.

Non-Unique Secondary Indexes (NUSIs)NUSIs are implemented on an AMP-local basis. Each AMP is responsible for maintaining only those NUSI subtable rows that correspond to base table rows located on that AMP. Since NUSIs allow duplicate index values and are based on different columns than the PI, data rows matching the supplied NUSI value could appear on any AMP. In a NUSI retrieval (illustrated at the bottom of the right-hand page), a message is sent to all AMPs to see if they have an index row for the supplied value. Those that do use the "pointers" in the index row to access their corresponding base table rows. Any AMP that does not have an index row for the NUSI value will not access the base table to extract rows.

USI Subtable General Row LayoutThe layout of a USI subtable row is illustrated at the top of the right-hand page. It is composed of several sections:The first two bytes designate the row length.The next 8 bytes contain the Row ID of the row. Within this Row ID,

4 bytes of Row Hash 4 bytes of Uniqueness Value.

The following 2 bytes are additional system bytes (which will be explained later).The next section contains the SI value. This is the value that was used by the Hashing Algorithm to generate the Row Hash for this row. This section varies in length depending on the index.Following the SI value are 8 bytes containing the Row ID of the base table row. The base table Row ID tells the system where the row corresponding to this particular USI value is located.The last two bytes contain the reference array pointer at the bottom of the block.The Teradata RDBMS creates one index subtable row for each base table row.

USI Subtable General Row Layout

• USI rows are distributed on Row Hash, like any other row.• The Row Hash is based on the base table secondary index value.• The second Row ID identifies the single base table row that carries the secondaryindex value (probably on a different AMP).• There is one index subtable row for each base table row.

USI AccessThe only difference between this and the three-part message used in PI access is that the Subtable ID portion of the Table ID references the USI subtable not the data table.

Using the DSW for the Row Hash, the Communication Layer directs the message to the correct AMP which uses the Table ID and Row Hash as a logical index block identifier and the Row Hash and USI value as the logical index row identifier. If the AMP succeeds in locating the index row, it

extracts the base table Row ID ("pointer"). The Subtable ID portion of the Table ID is then modified to refer to the base table and a new three-part message is put onto the Communications Layer.

Once again, the Communication Layer uses the DSW to identify the correct AMP. That AMP uses Table ID and Row Hash to locate the correct data block and then uses Row Hash and Uniqueness Value (Row ID) to locate the correct row.

NUSI Subtable General Row LayoutThere are, however, two major differences:

First, NUSI entries are not distributed by the Hash Map. NUSI subtable rows are built from the base table rows found on that particular AMP and refer only to the base rows of that AMP.

Second, NUSI rows may point to more than one base table row. There can be many base table Row IDs in a NUSI row. Because NUSIs are always AMP-local to the base table rows, it is possible to have the same NUSI value represented on multiple Amps. A NUSI subtable is just another table from the perspective of the file system.

NUSI Subtable General Row Layout

_ The Row Hash is based on the base table secondary index value._ The other Row IDs identify the base table rows on this AMP that carry the SecondaryIndex Value._ There are one or more subtable rows for each secondary index value on the AMP._ The Row IDs “point” to base table rows on this AMP only._ The maximum size of a single NUSI row is 64 KB.

Single NUSI Access (Between, Less Than, or Greater Than)

Utilize the NUSI and do a Full Table Scan (FTS) of the NUSI subtable. In this case, the Row IDs of the qualifying base table rows would be retrieved into spool. The Teradata RDBMS would use those Row IDs in spool to access the base table rows themselves.

Ignore the NUSI and do an FTS of the base table itself.In order to make this decision, the Optimizer requires COLLECTed STATISTICS.

Note: The only way to determine for certain whether an index is being used is to utilize the EXPLAIN facility.

Dual NUSI AccessAND with Equality ConditionsIf one of the two indexes is strongly selective, the system uses it alone for access.If both indexes are weakly selective, but together they are strongly selective, the system does a bit-map intersection.If both indexes are weakly selective separately and together, the system does an FTS.In any case, any conditions in the SQL statement not used for access (residual conditions) become row qualifiers.OR with Equality ConditionsDo an FTS of the two NUSI subtables.Retrieve Row IDs of qualifying base table rows into two separate spools.Eliminate duplicates from the two spools of Row IDs.Access the base rows from the resulting spool of Row IDs.If only one of the two columns joined by the OR is indexed, the Teradata RDBMS always does an FTS of the base tables.

NUSI Bit Mapping is a process that determines common Row IDs between multiple NUSI values by a process of intersection:

When aggregation is performed on a NUSI column, the Optimizer accesses the NUSI subtable that offers much better performance than accessing the base table rows. Better performance is achieved because there should be fewer index blocks and rows in the subtable than datablocks and rows in the base table, thus requiring less I/O.

LOCKING

Explicitly declares a lock type for one or more objects.Any lock may be upgraded.Only a READ lock may be downgraded to an ACCESS lock.Locks are never released or downgraded during a transaction.The system holds the most restrictive lock.No SQL "Release Lock" statement exists.Locks release only at COMMIT/End Transaction, or when Rollback completes.The Locking Modifier applies to a table, database, view, or row.LOCK ROW locks all rows that hash to a specific value. It is a row hash lock, not a row lock.Explicit locking creates an all-AMP (table or database) operation.

COLLECTed STATISTICS are stored in one of two Data Dictionary (DD) tables (DBC.TVFields or DBC.Indexes).

you can use the HELP STATISTICS statement to display information about current column or index statistics.Help StatisticsHELP STATISTICS returns the following information about each column or index for which statistics have been COLLECTed in a single table: The Date the statistics were last COLLECTed or refreshed.The Time the statistics were last COLLECTed or refreshed.The number of unique values for the column or index.The name of the column or column(s) that the statistics were COLLECTed on.Use Date and Time to help you determine if your statistics need to be refreshed or DROPped. The example on the right-hand page illustrates the HELP STATISTICS output for the employee table.Help IndexHELP INDEX is an SQL statement which returns information for every index in the specified table. An example of this command and the resulting BTEQ output is shown on the right-hand page. As you can see,HELP INDEX returns the following information:Whether or not the index is uniqueWhether the index is a PI or an SIThe name(s) of the column(s) which the index is based onThe Index ID NumberThe approximate number of distinct index values.This information is very useful in reading EXPLAIN output. Since the EXPLAIN statement only returns the Index ID number, you can use the HELP INDEX statement to determine the structure of the index with that ID.

Aggregation and DISTINCT Summary

Aggregation is fully parallelized.ARSA is an algorithm that outlines a method for performing aggregation on a parallel database.– Aggregate locally– Redistribute local aggregation– Sort redistributed aggregation– Aggregate globally

If values are nearly unique, DISTINCT may outperform a GROUP BY because there is no local aggregation.

Run EXPLAIN during Application Development to:

– Determine access paths– Show locking profiles– Validate the usage of indexes– Shows activities for triggers and join index access as well as DDL, DML, and DCL.

. . . (Last Use) . . . A spool file is no longer needed and will be released after this step.

. . . with no residual conditions . . . All applicable conditions have been applied to the rows.. . . END TRANSACTION . . .Transaction locks are released and changes are committed.. . . eliminating duplicate rows . . . Doing a DISTINCT operation. (Duplicate rows that only exist in spool files.). . . by way of the sort key in spool field1 . . . Field1 is created to allow a tag sort.. . . we do a SMS . . . A Set Manipulation Step caused by using a UNION,MINUS, or INTERSECT operator.

. . . we do a BMSMS . . . A way of handling two or more weakly selective secondaryindexes that are linked by AND in the WHERE clause. (BMSMS = Bit Map Set Manipulation Step). . . which is redistributed by hash code to all AMPs . . .Redistributing each of the data rows for a table to a new AMP based on the hash value for the column(s) involved in the join.. . . which is duplicated on all AMPs . . .Copying all rows of a table onto each AMP in preparation for a join. (e.g., a “Product Join”). . . which is built locally on the AMPs . . .Each Vproc builds a portion of a spool file from the data found on its local disk space.. . . by way of a traversal of index #n extracting row ids only . . .

A spool file is built from the row-id values found in a secondary index (index #n). These row id’s will be used later for extracting rows from a table.. . . Aggregate Intermediate Results are computed locally . . .Aggregation requires no redistribution phase.. . . Aggregate Intermediate Results are computed globally . . .Both local and global aggregation phases are performed.

. . . we lock a distinct <dbname>."pseudo table" . . .A way of preventing global deadlocks specific to an MPP (“massively parallel processing”) DBS like the Teradata database.. . . by way of a RowHash match scan . . .A merge join on hash values.. . . where unknown comparison will be ignored . . .Comparisons involving NULL values will be ignored.. . . We lock DBC.AccessRights for write on row hash . . .A DDL request is being executed.

. . . low-end row(s). . .Used by the RDBMS to estimate the size of the spool file needed to accommodate the data.. . . low-end row(s) to high-end row(s). . .The high-end gets propagated to subsequent steps but has no influence on the choosing the plan.. . . low-end time . . .Used in choosing the plan (based on low-end row estimate, cost constants, and cost functions).. . . low-end time to high-end time . . .The high-end has no influence on choosing the plan.

MultiLoad does not have rollback capability.

With regard to Fallback writes, FastLoad performs them at the end of the task, while MultiLoad and TPump do them during processing.

End-to-End Time to load may include any or all of the following:• Receipt of source data• Transformation and Cleansing• Acquisition• Target table apply• Fallback processing

• Permanent Journal processing• Secondary Index maintenance• Statistics maintenance

Transformation and Cleansing of data is performed before the data is loaded onto Teradata. The disadvantage of that is that this is a serial process. The more work you can perform on the Teradata side, the more you can take advantage of parallelism.

Note: An advantage of TeraBuilder-based utilities is that you can usemultiple pipes.

In general, if you have to perform statistical analysis of very large data sets, you should use Teradata to do this.

Transformation and CleansingWhere to do it?Consider the impact on load time.– Where is all the required data?

» Move Teradata data to client: "Export-Transform-Load"» Move client data to Teradata: "Load-Transform" or

"Load-Transform-Export-Reload"– Teradata side advantage: Parallelism

Guideline:– Simple transformations:

» Transformation pipe to load utility– Complex transformations:

» Transform on Teradata– When in doubt:

» Measure

FAST LOAD

Fastest load rateMost restrictive

– Inserts only– Empty target table required– Table not accessible until load is complete

Checkpoint/RestartNo RollbackFastLoad ELAPSED SECONDS = Acquisition Elapsed Seconds + Apply Elapsed Seconds

MultiLoadAfter data acquisition begins, there is an access lock on data, and a writelockon the table.During acquisition, you can release MultiLoad and cancel the job. During the apply phase, however, you cannot release MultiLoad.MultiLoad is restartable unless you drop the restart log or error table. You can look at a worktable, restart log, or error table, but you should use access locks on those tables when you look at them, or you may crash MultiLoad.

Some restrictionsTable-level locks

– Write Lock enables dirty reads, but does not enable other tasks to modify the table for the duration of the MultiLoad Apply.

USIs/RIs/Join Indexes not allowed on target table.

No Rollback, but uses a Checkpoint /Restart.– Acquisition: Checkpoints to your specification. (Client-side)– Apply: Checkpoints every datablock. (Database-side)

MultiLoad on NUPI Tables

NUPI performance:• MultiLoad with a highly non-unique NUPI can reduce performance considerably.– Some measurements show NUPI is 7 to 9 times slower than UPI.– Multiset helps reduce this difference by eliminating duplicate row checking.• But if NUPI improves locality of reference, NUPI MultiLoad can be faster than UPIMultiLoad!• NUPI MultiLoad with few (10 or less) rows/value performs like UPI MultiLoad.

MultiLoad—Maximizing PerformanceMinimize acquisition time by loading multiple tables with 1 source. (Up to 5 target tables.)Go for the highest hits per datablock ratio.– Do one, not multiple, MultiLoads to a single table.– Do less frequent MultiLoads.– Load to smaller target tables:

» Active vs. archive table partitions– Reduce your row size.– Use large datablock sizes.Consider drop and recreate secondary indexes.Concurrent MultiLoads maximize throughput. (vs. response time)– Two to three concurrent MultiLoads will saturate Teradata in apply phase.

Load Utilities—TPumpProcesses one update at a time—not intended for higher MultiLoad rates.Most compatible with mixed workloads and real time data availability goals.– Minimal lock contention (row level write lock)

Checkpoint / Restartable (Client side)Rollback (Teradata side)

Resource Governors can “slow down” the loads.

TPump Updates—Fallback and Journaling Costs

Fallback: Reduce throughput by 2X.Local Journaling (Before, After): Reduce throughput by 1.2X.After Journaling: Reduce throughput by 1.75X.

Cost for changing USI value:Change row sent to AMP owning USI row.Additional CPU/USI is 1.0X the CPU path of primary table insert/delete.– i.e.: If it takes 100 seconds to do the primary inserts/deletes, it will take an additional 100 seconds to update each USI.Cost for changing NUSI value (w/ 1 row/value):NUSI change row applied locally.Additional CPU/NUSI is 0.55X the CPU path of primary table insert/delete.– i.e.: If it takes 100 seconds to do the primary inserts/deletes, it will take an additional 55 seconds to update each NUSI.

TPump—Maximizing PerformanceToo few and you might not meet your window. Too many and you may impact current work. You must also consider the host resources.

If there is no fallback, indexes or journaling, then only a single step will be required to perform the prime index modification statement. If fallback is defined on the table, then an additional step, or task, is required to accomplish the prime index modification statement. Similarly, additional steps are required for journaling and index updates.

If your goal is to maximize TPump throughput without hinderance from any othercompeting workloads, the TPump client will need to drive into Teradata at least 30concurrent tasks per node.

(If your throughput goal is less than the maximum achievable from Teradata due to the need to share resources with competing workloads, then you will not need that many concurrent tasks.)There are lots of ways to achieve 30 concurrent tasks per node. .When the table has 2 indexes, but no fallback or journalling, (ie: 3 tasks per statement)you can:use 2 concurrent multi-statement request each with 5 statements

In TPump terms, the PACK factor represents the number of statements in a multistatementrequest. Sessions represent the number of concurrent requests.

For full table updates:No Index Management is required if the index columns are not the object of the update.For full table inserts and deletes:Secondary Index modifications are done a row-at-a-time.It's better to drop and recreate the index unless the number of rows to update is very small.(i.e.,: <= 0.1% of the rows being updated)

Teradata Join Plans (Or strategies)• MERGE

–INCLUSION–EXCLUSION

• PRODUCT• NESTED• ROW HASH• ROW ID

Relational Join Types• INNER• OUTER

–LEFT–RIGHT–FULL

• SELF• CROSS• CARTESIAN

In a product Join, every qualifying row of one table is compared to every qualifying row inthe other table. Rows which match on WHERE conditions are saved.

The WHERE clause is missing.

• A Join condition is not based on equality (NOT =, LESS THAN, GREATER THAN).• Join conditions are ORed together.• There are too few Join conditions.• A referenced table is not named in any Join condition.• Table aliases are incorrectly used.• The Optimizer determines that it is less expensive than the other Join types.

Merge Joins are commonly done when the Join condition is based on equality. They are generally more efficient than product joins because the number of rows comparisons is smaller.

1. Identify the Smaller Table.2. Put the qualifying data from one or both tables into Spool (if necessary).3. Move the Spool rows to the AMPs based on the Join Column hash (if necessary).4. Sort the Spool rows by Join Column Row Hash value (if necessary).5. Compare those rows with matching Join Column Row Hash values.• Compare matching join column row hash values for the rows.• Cause significantly fewer comparisons than a product join.• Require rows to be on the same AMP to be joined.

Row Redistribution—M1It is basically the same query as the previous example except that the join condition is nowequality. Teradata copies the employee rows into Spool and redistributes them on employee.dept Row Hash. The Merge Join then occurs with the rows to be Joined located on the same AMPs.

M1 occurs when one of the tables is already distributed on the Join Column Row Hash. The Join Column is the PI of one, not both, of the tables.Joining on very non-unique values (AMP 2) could cause “insufficient spool” errors. Collect Statistics to help the parser avoid this situation.

Row Redistribution—M2DUPLICATE and SORT the Smaller Table on all AMPs.LOCALLY BUILD a copy of the Larger Table and SORT.

M1 & M2 is selected when one UPI is used for join

Row Redistribution—M3M3 will occur when the Join Column is the Primary Index of both tables.

Merge Join—Matching Primary IndexesNote that there is no redistribution of rows or sorting which means that Merge Join Plan M3

is being used. (Remember that M3 occurs when the Join Column is the Primary Index of both tables.) No redistribution or sorting is needed since the rows are already on the proper AMPs and in the proper order for Joining.

Row Hash Join

In a Row Hash Join, the smaller table is sorted into join column row hash sequence and then duplicated on all AMPs. The larger table is then processed a row at a time. For those rows that qualify for joining (WHERE), the Row Hash of the join column(s) is used to do a binarysearch of the smaller table for a match.The Parser can choose this join plan when the qualifying rows of the small table can be held AMP memory resident.

Rows must be on the same AMP to be joined.

“UNION”The UNION operator is used instead of the OR operator which means that two separate answer sets are generated and then combined.

Join Strategies Summary

• The optimizer chooses the best join strategy based on available indexes and demographics.

Product Join• Most general form of join.• Does not sort rows.• Number of comparisons is product of number of rows in each table.• Should be avoided if possible.

Merge Join• Commonly done when join is based on equality.• Require some preparation.• Generally more efficient than a product join.• Better performance results if the left table (in the explain) is the unique (smaller) table.

Exclusion Join• Used for finding rows that don’t have a match.• Caused by NOT IN and EXCEPT.• Prevent null join values to get a result other than NULL.

Row Hash Join• Smaller table is sorted into join column row hash sequence and duplicated on all AMPs.• Can be used when rows of smaller table can be held in AMP memory.• COLLECT STATISTICS on both tables guides the parser.

Nested Join• Very efficient.• Doesn’t always use all AMPs.• Requires equality value for UPI or USI on Table1 and join on a column of that single row to any index on Table2.Cartesian Product Join

• Unconstrained product join.• Consumes significant system resources.

Unique constraints are implemented in the Teradata database as Unique Secondary Indexes(USIs).

Self Reference Teradata

Documents

Transcript of Self Reference Teradata