Skew Stuff

26
What is the meaning of skewness in teradata? When do we use nullif function in teradata? Skewness: Skewness is the statistical term, which refers to the row distribution on AMPs. If the data is highly skewed, it means some AMPs are having more rows and some very less i.e. data is not properly/evenly distributed. This affects the performance/Teradata's parallelism. The data distribution or skewness can be controlled by choosing indexes. When we choose wrong primary index the data unevenly distributed across all the amps means some amps have more records and some amps had less records this is called skewness. Percentage of skewness is called skew factor.30% skew factor is acceptable. We use nullif when we want to return a null value when the expression in that statement is true. In Teradata Administrator you can simply right-click the table and select "Space Summary". How will you avoid skewness ? Data or Amp skew occurs in teradata due to uneven distribution of data across all the amps. Often, this leads to spool space error too. To avoid skewness, try to select a Primary Index which has as many unique values as possible. PI columns like month, day, etc. will have very few unique values. So during data distribution only a few amps will hold

Transcript of Skew Stuff

What is the meaning of skewness in teradata? When do we use nullif function in teradata?

Skewness:

Skewness is the statistical term, which refers to the row distribution on AMPs. If the data is highly skewed, it means some AMPs are having more rows and some very less i.e. data is not properly/evenly distributed.

This affects the performance/Teradata's parallelism. The data distribution or skewness can be controlled by choosing indexes.

When we choose wrong primary index the data unevenly distributed across all the amps means some amps have more records and some amps had less records this is called skewness.Percentage of skewness is called skew factor.30% skew factor is acceptable.

We use nullif when we want to return a null value when the expression in that statement is true.

In Teradata Administrator you can simply right-click the table and select "Space Summary".

How will you avoid skewness ?

Data or Amp skew occurs in teradata due to uneven distribution of data across all the amps. Often, this leads to spool space error too. To avoid skewness, try to select a Primary Index which has as many unique values as possible.

PI columns like month, day, etc. will have very few unique values. So during data distribution only a few amps will hold all the data resulting in skew. If a column (or a combination of columns) is chosen a PI which enforces uniqueness on the table, then the data distribution will be even and the data will not be skewed.

Typical reasons of skewness :

(1) Skewed tables : Bad choice of PI, Skewed data (2) Bad execution plans (typically skewed redistributions)(3) Bad data model (normalization, data types,PI, etc.)(4) Missing or stale statistics(5) Too many joins (break up the query!)(6) Hash collision (load time problem)How to find the skew factor for a Teradata table (Or) Query to find SKEW FACTOR of a particular table :

When Teradata tables are not distributed evenly this can cause significant performance problems. Storage space is also not utilized properly.

For that reason identifying and monitoring tables that are not distributed evenly across AMPs is very important when working with Teradata databases.

The following query will reveal the skew factor for the SALES and CUSTOMERS tables from a Teradata table.

SELECT DatabaseName,TableName,SUM(CurrentPerm) AS CurrentPerm ,SUM(PeakPerm) AS PeakPerm,(100 - (AVG(CurrentPerm)/MAX(CurrentPerm)*100)) AS SkewFactor FROM DBC.TableSize WHERE databasename = database AND TableName IN ('SALES' , 'CUSTOMERS') GROUP BY 1,2 ORDER BY SkewFactor DESC;

Query to find SKEW FACTOR of a particular table :

SELECT TABLENAME,SUM(CURRENTPERM) /(1024*1024) AS CURRENTPERM,(100 - (AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS SKEWFACTOR FROM DBC.TABLESIZE WHERE DATABASENAME= AND TABLENAME = GROUP BY 1;(or)select databasename,tablename,cast((100 -(AVG(a.CurrentPerm)/MAX(a.CurrentPerm)*100)) as integer) AS SkewFactorfrom dbc.tablesize a where databasename='DatabaseName' and tablename ='TableName' group by 1,2 order by 1;

FINDING SKEW FACTOR FOR TABLE LEVEL :

SELECT databasename,tablename,100 * (1 - (AVG(currentperm) / MAX(currentperm))) skew_factor FROM dbc.tablesize where databasename='TEMP_DB'GROUP BY 1,2 ORDER BY 1,2;

To Find available and used space in Teradata (GB) :

select DB.OWNERNAME,DS.databasename,cast(sum(maxperm)/1024/1024/1024 as decimal (7,1)) MaxPerm_g,cast(sum(currentperm)/1024/1024/1024 as decimal (7,1)) CurrPerm_g,cast(sum(maxperm)/1024/1024/1024-sum(currentperm)/1024/1024/1024 as decimal (7,1)) FreePerm_g,cast(sum(currentperm)*100/nullifzero(sum(maxperm))as decimal (5,2)) PctUsed,sum(maxperm) MaxPerm,sum(currentperm) CurrPerm,sum(maxperm)- sum(currentperm) FreePerm ,db.commentstring from dbc.diskspace DS, DBC.DATABASES DB where db.databasename like '%IDR%' AND DS.DATABASENAME = DB.DATABASENAME group by 1,2,db.commentstringorder by 1,sum(maxperm)- sum(currentperm) desc,sum(maxperm) desc ,ds.databasename;);

Note: To fix skew problems, you need to change the tables primary index.For an empty table, you can use ALTER TABLE:

ALTER TABLE orders MODIFY PRIMARY INDEX (ORDER_NUM);

For already populated table you might need to recreate the table.

You can use the following sequence:

RENAME TABLE orders TO orders_tmp;

Step 1. Create an empty table

CREATE TABLE orders_tmp AS (SELECT * FROM orders) WITH NO DATA;

Step 2. Alter the primary index for the new empty table

ALTER TABLE orders_tmp MODIFY PRIMARY INDEX (ORDER_NUM);

Step 3. Copy data from the skewed table

INSERT INTO orders_tmp SELECT * FROM orders;

Step 4. Remove the old table and rename the new table

RENAME TABLE orders TO orders_skewed;RENAME TABLE orders_tmp TO orders;

You might want to keep the old table for a while and after you are sure you are not going to need it, drop it:

DROP TABLE orders_skewed;Select * from employee1,employee2 where employee2.EMPLOYEE_ID=employee1.EMPLOYEE_ID and employee2.EMPLOYEE_ID=

Skew factor is tell the distribution of rows if uniformally distribution i.e skew is the zeroif the skew factor is reverse of parllel efficence

Finding Skewed Tables in Teradata

Skewed Tables in Teradata :

Teradata distributes its data for a given table based on its Primary Index and Primary Index alone .If this Primary Index is not selected appropriately it can cause performance bottle necks.

Following are general parameters we should consider while creating Primary Indexes 1:Access Path 2:Volatility 3: Data Distribution Volatility is usually not an issue in a well designed database .That is to say we do not expect update clauses updating the primary index itself. This usually leaves us with data distribution and Access Path.

Access Path implies particular column ( set of columns ) are always used in join conditions.Advantage of using these columns as PI is that it will avoid redistribution of data during join .( One of the most expensive operation for teradata ) and therefore can reduce usage of Spool files.

Data distribution implies, Data is evenly distributed amongst all amps of teradata. This will ensure that all amps would be doing same amount of work. In case of Skewed distribution one amp will end up doing most of the work this can increase execution time considerable.Following Queries can be used to determine table skeweness.

SELECT DatabaseName,TableName,SUM(CurrentPerm) AS CurrentPerm,(SUM(PeakPerm) AS PeakPerm,100 - (AVG(CurrentPerm)/MAX(CurrentPerm)*100)) AS SkewFactorFROM DBC.TableSize WHERE TABLENAME =:TABLENAME GROUP BY 1,2;

However tables which have very few rows (esp. lesser than amps) are inherently skewed and cannot be changed. So we should take into account size of table also while considering skeweness.

Query can be further customized to have better details :

SELECT DATABASENAME, TABLENAME,SKEW_CATEGORY FROM(,SELECT DATABASENAME,TABLENAME,(MAX(CURRENTPERM)-AVG(CURRENTPERM))/MAX(CURRENTPERM) *100 SKEW_FACTOR,SUM(CURRENTPERM) /(1024*1024*1024) ACTUAL_SPACECASE WHEN SKEW_FACTOR >90 THEN 'EXTREMELY SKEWED,T0 BE RECTIFIED IMMEDIATELY'WHEN SKEW_FACTOR >70 THEN 'HIGHLY SKEWED'WHEN SKEW_FACTOR >50 THEN 'SKEWED TABLE'WHEN SKEW_FACTOR >40 THEN 'SLIGHTLY SKEWED'WHEN SKEW_FACTOR >30 AND ACTUAL_SPACE >.5 THEN 'SLIGHTLY SKEWED'ELSE 'ACCEPTABLE'END as SKEW_CATEGORY FROM DBC.TABLESIZE WHERE DATABASENAME in ('DATABASENAME') GROUP BY 1,2 )x ORDER BY SKEW_FACTOR DESC;

To find distribution of data amongst amps, we can use hash functions as follows

SELECT HASHAMP(HASHBUCKET(HASHROW( Column ))),COUNT(*) FROM DATABASENAME.TABLENAME GROUP BY 1;

This factor can further be exploited to find distribution for primary index as follows :

SEL 'ColumnName ' , (MAX(CN)-AVG(CN))/MAX(CN)*100 FROM ( SEL VPROC ,COUNT(*) CN FROM DATABASENAME.TABLENAME RIGHT OUTER JOIN (SEL VPROC FROM DBC.TABLESIZE GROUP BY 1 )X ON VPROC=HASHAMP(HASHBUCKET(HASHROW( ColumnName ))) GROUP BY 1) CGROUP BY 1 ;

This query can also be used to check which singular column can be best fitted for primary index on sole criteria of data distribution.

Following query is a generic query which would create SQL statements to check distribution for that column amongst amps.

SEL' SEL ' || '27'xc || COLUMNNAME || '27'xc || ' , (MAX(CN)-AVG(CN))/MAX(CN)*100FROM ( SEL VPROC ,COUNT(*) CN FROM' ' || DATABASENAME || '.' ||TABLENAME ||RIGHT OUTER JOIN (SEL VPROC FROM DBC.TABLESIZE GROUP BY 1)X ON VPROC=HASHAMP(HASHBUCKET(HASHROW( '|| COLUMNNAME || ' )))GROUP BY 1) C GROUP BY 1UNION ALL ' Title FROM DBC.COLUMNS WHERE TABLENAME='AND DATABASENAME=' ;

What does skew metric mean?

You can see this word "Skewness" or "Skew factor" in a lot of places regarding Teradata: documents, applications, etc. Skewed table, skewed cpu.It is something wrong, but what does it explicitly mean? How to interpret it?

Teradata is a massive parallel system, where uniform units (AMPs) do the same tasks on that data parcel they are responsible for. In an ideal world all AMPs share the work equally, no one must work more than the average. The reality is far more cold, it is a rare situation when this equality (called "even distribution") exists.

It is obvious that uneven distribution will cause wrong efficiency of using the parallel infrastructure.

But how bad is the situation? Exactly that is what Skewness characterizes.

Let "RESOURCE" mean the amount of resource (CPU, I/O, PERM space) consumed by an AMP.

Let AMPno is the number of AMPs in the Teradata system.

Skew factor := 100 - ( AVG ( "RESOURCE" ) / NULLIFZERO ( MAX ("RESOURCE") ) * 100 )

Total[Resource] := SUM("RESOURCE")

Impact[Resource] := MAX("RESOURCE") * AMPno

Parallel Efficiency := Total[Resource] / Impact[Resource] * 100

or with some transformation:

Parallel Efficiency := 100 - Skew factor

Analysis

Codomain0 0*//* For online DBQL*/ dbc.dbqlogtbl where cast(cast(starttime as char(10)) as date) = '2013-12-18' (date) and ampcputime>0order by CPUImpact desc

Explanation of extra fields:

ParserCPUTime: Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.

LHR/RHL: Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.

QueryBand: Labels that sessions use to identify themselves within the DBQL logs

QueryText: First 200 characters of the query (depending on DBQL log settings)

OK, we've listed the terrible top consumers, but what's next?Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post...) , you can find which job or report issued that SQL, anyway, you can see the QueryText field.

If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate and values:

select SQLTextInfo from dbc.dbqlsqltbl where procid= and queryid=order by SQLRowNo asc

You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.

System level Skewness :

We've found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?

The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.

select sum(AMPCPUTime) AMPCPUTimeSum, sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum, sum(TotalIOCount) TotalIOCountSum, sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum, cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%", cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"from/* For archived DBQL dbql_arch.dbqlogtbl_hst where logdate = '2013-12-18' (date) and (ampcputime>0 or TotalIOCount > 0)*//* For online DBQL*/ dbc.dbqlogtbl where cast(cast(starttime as char(10)) as date) = '2013-12-18' (date) and (ampcputime>0 or TotalIOCount > 0)

Look at the last two columns. That percent of your CPU and I/O goes to the sink...

OK, let's check how many queries accumulate 5%,10%,25%,50%,75%,90% of this loss?

Here you are (CPU version, transform for I/O implicitly):

select 'How many queries?' as "_",min(limit5) "TOP5%Loss",min(limit10) "TOP10%Loss",min(limit25) "TOP25%Loss",min(limit50) "TOP50%Loss",min(limit75) "TOP75%Loss",min(limit90) "TOP90%Loss", max(rnk) TotalQueries, sum(ResourceTotal) "TotalResource", sum(ResourceImpact) "ImpactResource"from(select case when ResRatio < 5.00 then null else rnk end limit5,case when ResRatio < 10.00 then null else rnk end limit10,case when ResRatio < 25.00 then null else rnk end limit25,case when ResRatio < 50.00 then null else rnk end limit50,case when ResRatio < 75.00 then null else rnk end limit75,case when ResRatio < 90.00 then null else rnk end limit90,rnk, ResourceTotal, ResourceImpactfrom(select sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes, sum(ResourceLoss) over (order by ResourceLoss desc rows unbounded preceding) subtotalRes, subtotalRes *100.00 / totalRes Resratio, sum(1) over (order by ResourceLoss desc rows unbounded preceding) rnk, ResourceTotal, ResourceImpactfrom(select AMPCPUTime ResourceTotal, (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact, ResourceImpact - ResourceTotal ResourceLoss/* For archived DBQLfrom dbql_arch.dbqlogtbl_hst where logdate=1131207and ampcputime>0*//* For online DBQL*/from dbc.dbqlogtbl wherecast(cast(starttime as char(10)) as date) = '2013-12-18' (date)and ampcputime>0) x) y) zgroup by 1

I expect you are a bit shocked now, how few queries waste how much golden resources.I think we will agree that it is worth to tune those dozen of queries, and you save in orders of 100K..MUSD for your company annually, am I right?