Stack It And Unpack It

download Stack It And Unpack It

If you can't read please download the document

description

Partitioning and Compression for Datawarehouses.

Transcript of Stack It And Unpack It

  • 1. Stack It & Pack It Partitioning And Compression For Warehouses / VLDB Jeff Moss

2. Who Dunnit ? 3. Agenda

  • My background
  • Squeeze your data with data segment compression
  • Partition for success
  • Questions

4. My Background

  • Independent Consultant
  • 13 years Oracle experience
  • Blog:http://oramossoracle.blogspot.com/
  • Focused on warehousing / VLDB since 1998
  • First project
    • UK Music Sales Data Mart
    • Produces BBC Radio 1 Top 40 chart and many more
    • 2 billion row sales fact table
    • 1 Tb total database size
  • Currently working with Eon UK (Powergen)
    • 4Tb Production Warehouse, 8Tb total storage
    • Oracle Product Stack

5. What Is Data Segment Compression ?

  • Compresses data by eliminating intra block repeated column values
  • Reduces the space required for a segment
    • but only if there are appropriate repeats!
  • Self contained
  • Lossless algorithm

6. Where Can Data Segment Compression Be Used ?

  • Can be used with a number of segment types
    • Heap & Nested Tables
    • Range or List Partitions
    • Materialized Views
  • Cant be used with
    • Subpartitions
    • Hash Partitions
    • Indexes but they have row level compression
    • IOT
    • External Tables
    • Tables that are part of a Cluster
    • LOBs

7. How Does Segment Compression Work ? Database Block Symbol Table Row Data Area Block Common Header (20 bytes) Transaction Header (24 bytes fixed + 24 bytes per ITL) Data Header (14 bytes) Compressed Data Header (16 bytes -variable ) Tail (4 bytes) 100 Call to discuss bill amount TEL NO YES 3 TEL 4 NO 5 YES 2 Call to discuss bill amount 1 100 1 2 3 4 5 101 Call to discuss new product MAIL NO N/A 8 MAIL 9 N/A 7 Call to discuss new product 6 101 6 7 8 4 9 102 Call to discuss new product TEL YES N/A 10 7 3 5 9 10 102 ID DESCRIPTION CONTACT TYPE OUTCOME FOLLOWUP Table Directory (8 bytes) Row Directory (2 bytesper row ) 8. What Affects Compression ?

  • Undisclosed Algorithm
    • I asked but support wouldnt play ball!
  • Many Factors
    • Block size
    • Anything which affectsblock overhead
      • Interested Transaction Lists ( INITRANS )
      • Number of columns
      • Number of rows
      • PCTFREE
    • Number of repeats ( in the block )
    • Length of column value(s)

9. Compression v Block Size

  • 200K rows, Non ASSM Uniform Local extents
  • More chance of repeats in any given block

10. Compression v ITL

  • 10K rows, Non ASSM Uniform Local extents
  • More ITL = more overhead = less repeats

11. Compression v Number Of Columns

  • 500K rows, Non ASSM Uniform Local extents
  • Same amount of data to store
  • More columns = more overhead = less repeats

12. Compression v PCTFREE

  • 200K rows, Non ASSM Uniform Local extents
  • Higher PCTFREE = less space = less repeats

13. Compression v NDV

  • 200K rows, Non ASSM Uniform Local extents
  • Higher NDV = less repeats

14. Compression v Column Length

  • 80K rows, Non ASSM Uniform Local extents
  • Minimum 6 characters for compression
  • Longer Length = more compression savings

15. Compression v Ordering

  • Colocate data to maximise compression benefits
  • For maximum compression
    • Minimise the total space required by the segment
    • Identify most compressable column(s)
  • For optimal access
    • We know how the data is to be queried
    • Order the data by
      • Access path columns
      • Then the next most compressable column(s)

Uniformly distributed Colocated 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 16. Get Max Compression Order Package

    • PROCEDURE mgmt_p_get_max_compress_order
    • Argument NameTypeIn/Out Default?
    • ------------------------------ ----------------------- ------ --------
    • P_TABLE_OWNERVARCHAR2INDEFAULT
    • P_TABLE_NAMEVARCHAR2IN
    • P_PARTITION_NAMEVARCHAR2INDEFAULT
    • P_SAMPLE_SIZENUMBERINDEFAULT
    • P_PREFIX_COLUMN1VARCHAR2INDEFAULT
    • P_PREFIX_COLUMN2VARCHAR2INDEFAULT
    • P_PREFIX_COLUMN3VARCHAR2INDEFAULT
    • BEGIN
    • mgmt_p_get_max_compress_order(p_table_owner => AE_MGMT
    • ,p_table_name =>BIG_TABLE
    • ,p_sample_size =>10000);
    • END:
    • /

Running mgmt_p_get_max_compress_order... ---------------------------------------------------------------------------------------------------- Table: BIG_TABLE Sample Size: 10000 Unique Run ID: 25012006232119 ORDER BY Prefix: ---------------------------------------------------------------------------------------------------- Creating MASTER Table: TEMP_MASTER_25012006232119 Creating COLUMN Table 1: COL1 Creating COLUMN Table 2: COL2 Creating COLUMN Table 3: COL3 ---------------------------------------------------------------------------------------------------- The output below lists each column in the table and the number of blocks/rows and space used when the table data is ordered by only that column, or in the case where a prefix has been specified, where the table data is ordered by the prefix and then that column. From this one can determine if there is a specific ORDER BY which can be applied to to the data in order to maximise compression within the table whilst, in the case of a a prefix being present, ordering data as efficiently as possible for the most common access path(s). ---------------------------------------------------------------------------------------------------- NAMECOLUMNBLOCKSROWS SPACE_GB ============================== ============================== ============ ============ ======== TEMP_COL_001_25012006232119COL129010000 .0022 TEMP_COL_002_25012006232119COL234510000 .0026 TEMP_COL_003_25012006232119COL355510000 .0042 17. Pros & Cons

  • Pros
    • Saves space
      • Reduces LIO / PIO
      • Speeds up backup/recovery
      • Improves query response time
    • Transparent
      • To readers
      • and writers
    • Decreases time to perform some DML
      • Deletesshould bequicker
      • Bulk insertsmaybe quicker

18. Pros & Cons

  • Cons
    • Increases CPU load
    • Can only be used on Direct Path operations
      • CTAS
      • Serial Inserts using INSERT /*+ APPEND */
      • Parallel Inserts (PDML)
      • ALTER TABLEMOVE
      • Direct Path SQL*Loader
    • Increases time to perform some DML
      • Bulk insertsmaybe slower
      • Updates are slower

19. Data Warehousing Specifics

  • Star Schema compresses better than Normalized
    • More redundant data
  • Focus on
    • Fact Tables and Summaries in Star Schema
    • Transaction tables in Normalized Schema
  • Performance Impact 1
    • Space Savings
      • Star schema: 67%
      • Normalized: 24%
    • Query Elapsed Times
      • Star schema: 16.5%
      • Normalized: 10%

1 -Table Compression in Oracle 9iR2: A Performance Analysis 20. Things To Watch Out For

  • DROP COLUMN is awkward
    • ORA-39726: Unsupported add/drop column operation on compressed tables
    • Uncompress the table and try again - still gives ORA-39726!
  • After UPDATEs data is uncompressed
    • Performance impact
    • Row migration
  • Use appropriate physical design settings
    • PCTFREE 0- pack each block
    • Large blocksize -reduce overhead / increase repeats per block
    • Minimise INITRANS -reduce overhead
  • Order data for best compression / access path

21. A Funny Thing

  • Block dump trace files still show 9iR2 even in 10g releases
  • ALTER SYSTEM DUMP DATAFILE x BLOCK y;

Thanks to Julian Dyke for the block dumping information http://www.juliandyke.com 22. What Is Partitioning ?

  • Partitioningaddresses key issues in supporting very large tables and indexes by letting you decompose them intosmallerand moremanageablepieces calledpartitions . Oracle Database Concepts Manual, 10gR2
  • Introduced in Oracle 8.0
  • Numerous improvements since
  • Subpartitioning adds another level of decomposition
  • Partitions and Subpartitions are logical containers

23. Partition To Tablespace Mapping

  • Partitions map to tablespaces
    • Partition can only be in One tablespace
    • Tablespace can hold many partitions
    • Highest granularity is One tablespace per partition
    • Lowest granularity is One tablespace for all the partitions
  • Tablespace volatility
    • Read / Write
    • Read Only

P_JAN_2005 P_FEB_2005 P_MAR_2005 P_APR_2005 P_MAY_2005 P_JUN_2005 P_JUL_2005 P_AUG_2005 P_SEP_2005 P_OCT_2005 P_NOV_2005 P_DEC_2005 T_Q1_2005 T_Q2_2005 T_Q3_2005 T_Q4_2005 T_Q1_2006 P_JAN_2006 P_FEB_2006 P_MAR_2006 T_Q3_2005 Read / Write Read Only 24. Read Only Tablespaces

  • Quicker checkpointing
  • Quicker backup
  • Quicker recovery
  • Reduced space use via compression
  • But
  • depends on granularity

Partition Tablespace 25. Why Partition ? - Performance

  • Improved query performance
    • Pruning or elimination
    • Partition wise joins
      • Full
      • Partial
  • Selective Compression
    • By Partition
  • Selective Reorganisation
    • Index Partition REBUILD
    • Table Partition MOVE

SELECT SUM(sales)FROM part_tab WHERE sales_date BETWEEN 01-JAN-2005AND 30-JUN-2005 Sales Fact Table * Oracle 10gR2 Data Warehousing Manual JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC 26. Why Partition ? - Manageability

  • Archiving
    • Use a rolling window approach
    • ALTER TABLE ADD/SPLIT/DROP PARTITION
  • Easier ETL Processing
    • Build a new dataset in a staging table
    • Add indexes and constraints
    • Collect statistics
    • Then swap the staging table for a partition on the target
      • ALTER TABLEEXCHANGE PARTITION
  • Easier Maintenance
    • Table partition move, e.g. to compress data
    • Local Index partition rebuild

27. Why Partition ? - Scalability

  • Partition is generally consistent and predictable
    • Assuming an appropriate partitioning key is used
    • and data has an even distribution across the key
  • Read only approach
    • Scalable backups - read only tablespaces are ignored
    • so partitions in those tablespaces are ignored
  • Pruning allows consistent query performance

28. Why Partition ? - Availability

  • Offline data impact minimised
    • depending on granularity
    • Quicker recovery
    • Pruned data not missed
    • EXCHANGE PARTITION
      • Allows offline build
      • Quick swap over

P_JAN_2005 P_FEB_2005 P_MAR_2005 P_APR_2005 P_MAY_2005 P_JUN_2005 P_JUL_2005 P_AUG_2005 P_SEP_2005 P_OCT_2005 P_NOV_2005 P_DEC_2005 T_Q1_2005 T_Q2_2005 T_Q3_2005 T_Q4_2005 T_Q1_2006 P_JAN_2006 P_FEB_2006 P_MAR_2006 T_Q3_2005 Read / Write Read Only 29. Fact Table Partitioning Transaction Date Load Date

  • Easier ETL Processing
    • Each load deals with only 1 partition
  • No use to end user queries!
    • Cant prune Full scans!
  • Harder ETL Processing
    • But still uses EXCHANGE PARTITION
  • Useful to end user queries
    • Allows full pruning capability

07-JAN-2005 Customer 1 09-JAN-2005 15-JAN-2005 Customer 2 17-JAN-2005 January Partition February Partition 22-JAN-2005 Customer 3 01-FEB-2005 02-FEB-2005 Customer 4 05-FEB-2005 26-FEB-2005 Customer 5 28-FEB-2005 March Partition 06-MAR-2005 Customer 2 07-MAR-2005 12-MAR-2005 Customer 3 15-MAR-2005 Tran Date Customer Load Date April Partition 21-JAN-2005 Customer 7 04-APR-2005 09-APR-2005 Customer 9 10-APR-2005 07-JAN-2005 Customer 1 09-JAN-2005 15-JAN-2005 Customer 2 17-JAN-2005 21-JAN-2005 Customer 7 04-APR-2005 22-JAN-2005 Customer 3 01-FEB-2005 January Partition February Partition 02-FEB-2005 Customer 4 05-FEB-2005 26-FEB-2005 Customer 5 28-FEB-2005 March Partition 06-MAR-2005 Customer 2 07-MAR-2005 12-MAR-2005 Customer 3 15-MAR-2005 Tran Date Customer Load Date April Partition 09-APR-2005 Customer 9 10-APR-2005 30. Watch out for

  • Partition exchange and table statistics 1
    • Partition stats updated
    • but Global stats are NOT!
    • Affects queries accessing multiple partitions
    • Solution
      • Gather stats on staging table prior to EXCHANGE
      • Partition exchange
      • Gather stats on partitioned table using GLOBAL

Jonathan Lewis: Cost-Based Oracle Fundamentals, Chapter 2 31. Partitioning Feature: Characteristic Reason Matrix Partition Truncation Exchange Partition Archiving Pruning (Partition Elimination) Partition wise joins Parallel DML Local Indexes Read Only Partitions Availability Scalability Manageability Performance Characteristic: Feature: 32. Questions ? 33. References: Papers

  • Table Compression in Oracle 9iR2: A Performance Analysis
  • Table Compression in Oracle 9iR2: An Oracle White Paper
  • Scaling To Infinity, Partitioning In Oracle Data Warehouses, Tim Gorman
  • Decision Speed: Table Compression In Action

34. References: Online Presentation / Code

  • http://www.oramoss.demon.co.uk/presentations/stackitandpackit.ppt
  • http://www.oramoss.demon.co.uk/Code/mgmt_p_get_max_compression_order.prc
  • http://www.oramoss.demon.co.uk/Code/test_dml_performance_delete.sql
  • http://www.oramoss.demon.co.uk/Code/test_dml_performance_insert.sql
  • http://www.oramoss.demon.co.uk/Code/test_dml_performance_update.sql
  • http://www.oramoss.demon.co.uk/Code/test_block_size_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_column_length_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_itl_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_ndv_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_num_cols_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_pctfree_compression.sql