DWDM Unit 2 Part 2 by Jithender Tulasi

63
DWDM Unit-2 Part-2 Data Cube Computation & Data Generalization By Jithender Tulasi

Transcript of DWDM Unit 2 Part 2 by Jithender Tulasi

Page 1: DWDM Unit 2 Part 2 by Jithender Tulasi

DWDMUnit-2Part-2

Data Cube Computation & Data Generalization

By

Jithender Tulasi

Page 2: DWDM Unit 2 Part 2 by Jithender Tulasi

• Data generalization is a process that abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels.

• Data warehousing and OLAP perform data generalization by summarizing data at varying levels of abstraction.

Page 3: DWDM Unit 2 Part 2 by Jithender Tulasi

1. Efficient Methods for Data Cube Computation: Data cube computation is an essential

task in data warehouse implementation.

1.1:A Road Map for Materialization of Different Kinds of Cubes:

• Data cubes facilitate the on-line analytical processing of multidimensional data.

Page 4: DWDM Unit 2 Part 2 by Jithender Tulasi

1.1:Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube

• Figure below shows a 3-D data cube for the dimensions A, B, and C, and an aggregate measure, M.

• A data cube is a lattice of cuboids. • Each cuboid represents a group-by.• ABC is the base cuboid, containing all three of the

dimensions. Here, the aggregate measure, M, is computed for each possible combination of the three dimensions.

Page 5: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 6: DWDM Unit 2 Part 2 by Jithender Tulasi

• The base cuboid is the least generalized of all of the cuboids in the data cube.

• The most generalized cuboid is the apex cuboid, commonly represented as all.

• It contains one value—it aggregates measure M for all of the tuples stored in the base cuboid.

• To drill down in the data cube, we move from the apex cuboid, downward in the lattice.

• Roll up, we move from the base cuboid, upward.• A cell in the base cuboid is a base cell. • A cell from a non base cuboid is an aggregate cell.

Page 7: DWDM Unit 2 Part 2 by Jithender Tulasi

Example: Base and aggregate cells. • Consider a data cube with the dimensions month,

city, and customer group, and measure price. • (Jan, * , * , 2800) and (*, Toronto, * , 1200) are 1-D

cells.• (Jan, * , Business, 150) is a 2-D cell, • (Jan, Toronto, Business, 45) is a 3-D cell. • Here, all base cells are 3-D, whereas 1-D and 2-D

cells are aggregate cells.

Page 8: DWDM Unit 2 Part 2 by Jithender Tulasi

Example : Ancestor and descendant cells. • Referring to Ancestor and descendant cells:• 1-D cell a=(Jan, * , * , 2800) and 2-D cell b = (Jan, *, Business, 150), are ancestors of

3-D cell c = (Jan, Toronto, Business, 45);• C is a descendant of both a and b.• b is a parent of c, and c is a child of b.• In order to ensure fast on-line analytical processing,

it is sometimes desirable to pre-compute the full cube (i.e., all the cells of all of the cuboids for a given data cube).

Page 9: DWDM Unit 2 Part 2 by Jithender Tulasi

• A data cube of n dimensions contains 2^n cuboids.• Individual cuboids may be stored on secondary

storage and accessed when necessary.• Partial materialization of data cubes offers an

interesting trade-off between storage space and response time for OLAP.

• Many cells in a cuboid, the measure value will be zero.

• When the product of the cardinalities for the dimensions in a cuboid is large relative to the number of nonzero-valued tuples that are stored in the cuboid, then we say that the cuboid is sparse.

Page 10: DWDM Unit 2 Part 2 by Jithender Tulasi

• If a cube contains many sparse cuboids, we say that the cube is sparse.

• The cells that cannot pass the threshold are likely to be too trivial to warrant further analysis.

• Such partially materialized cubes are known as iceberg cubes.

• The minimum threshold is called the minimum support threshold, or minimum support(min sup).

• An iceberg cube can be specified with an SQL query, as shown in the following example.

Page 11: DWDM Unit 2 Part 2 by Jithender Tulasi

• A cell, c, is a closed cell if there exists no cell, d, such that d is a specialization (descendant) of cell c and d has the same measure value as c.

• A closed cube is a data cube consisting of only closed cells.

Page 12: DWDM Unit 2 Part 2 by Jithender Tulasi

• For example, “(a1, * , * ,* , ) : 20” can be derived from “(a1, a2, *, * , ) : 20” because the former is a generalized nonclosed cell of the latter.

Page 13: DWDM Unit 2 Part 2 by Jithender Tulasi

1.1.1: General Strategies for Cube Computation:• In general, there are two basic data structures used

for storing cuboids.1. Relational tables are used as the basic data

structure for the implementation of relational OLAP (ROLAP),

2. Multidimensional arrays are used as the basic data structure in multidimensional OLAP (MOLAP).

• Although ROLAP and MOLAP may each explore different cube computation techniques, some optimization “tricks” can be shared among the different data representations.

Page 14: DWDM Unit 2 Part 2 by Jithender Tulasi

• The following are general optimization techniques for the efficient computation of data cubes.

1. Optimization Technique 1: Sorting, hashing, and grouping.

2. Optimization Technique 2: Simultaneous aggregation and caching intermediate results.

3. Optimization Technique 3: Aggregation from the smallest child, when there exist multiple child cuboids.

Page 15: DWDM Unit 2 Part 2 by Jithender Tulasi

4. Optimization Technique 4: The Apriori pruning method can be explored to compute iceberg cubes efficiently.

• The Apriori property, in the context of data cubes, states as follows: If a given cell does not satisfy minimum support, then no descendant (i.e., more specialized or detailed version) of the cell will satisfy minimum support either.

• This property can be used to substantially reduce the computation of iceberg cubes.

Page 16: DWDM Unit 2 Part 2 by Jithender Tulasi

1.2: Multiway Array Aggregation for Full Cube computation:• The Multiway Array Aggregation method computes a full

data cube by using a multidimensional array as its basic data structure.

• It is a typical MOLAP approach that uses direct array addressing, where dimension values are accessed via the position or index of their corresponding array locations.

• Hence, MultiWay cannot perform any value-based reordering.

• A different approach is developed for the array-based cube construction, as follows:

Page 17: DWDM Unit 2 Part 2 by Jithender Tulasi

1. Partition the array into chunks. A chunk is a subcube that is small enough to fit into the memory available for cube computation.

• Chunking is a method for dividing an n-dimensional array into small n-dimensional chunks, where each chunk is stored as an object on disk.

• The chunks are compressed so as to remove wasted space resulting from empty array cells (i.e., cells that do not contain any valid data, whose cell count is zero).

• For instance, “chunkID + offset” can be used as a cell addressing mechanism to compress a sparse array structure and when searching for cells within a chunk. Such a compression technique is powerful enough to handle sparse cubes, both on disk and in memory.

Page 18: DWDM Unit 2 Part 2 by Jithender Tulasi

• Because this chunking technique involves “overlapping” some of the aggregation computations, it is referred to as multiway array aggregation.

• It performs simultaneous aggregation i.e. it computes aggregations simultaneously on multiple dimensions.

Page 19: DWDM Unit 2 Part 2 by Jithender Tulasi

Example : Multiway array cube computation.• Consider a 3-D data array containing the three dimensions

A, B, and C. The 3-D array is partitioned into small, memory-based chunks. In this example, the array is partitioned into 64 chunks as shown in Figure 4.3. Dimension A is organized into four equal-sized partitions, a0, a1, a2, and a3. Dimensions B and C are similarly organized into four partitions each. Chunks 1, 2, . . . , 64 correspond to the sub cubes a0b0c0, a1b0c0, . . . , a3b3c3, respectively. Suppose that the cardinality of the dimensions A, B, and C is 40, 400, and 4000, respectively. Thus, the size of the array for each dimension, A, B, andC, is also 40, 400, and 4000, respectively. The size of each partition in A, B, and C is therefore 10, 100, and 1000, respectively. Full materialization of the corresponding data cube involves the computation of all of the cuboids defining this cube. The resulting full cube consists of the following cuboids:

Page 20: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 21: DWDM Unit 2 Part 2 by Jithender Tulasi

1.3:BUC: Computing Iceberg Cubes from the Apex Cuboid Downward:

• BUC is an algorithm for the computation of sparse and iceberg cubes.

• Unlike MultiWay, BUC constructs the cube from the apex cuboid toward the base cuboid.

• This allows BUC to share data partitioning costs. • This order of processing also allows BUC to prune

during construction, using the Apriori property.• BUC stands for “Bottom-Up Construction.”

Page 22: DWDM Unit 2 Part 2 by Jithender Tulasi

• It consolidates the notions of drill-down (where we can move from a highly aggregated cell to lower, more detailed cells) and roll-up (where we can move from detailed, low-level cells to higher level, more aggregated cells).

Page 23: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 24: DWDM Unit 2 Part 2 by Jithender Tulasi

Snapshot of BUC partitioning given an example 4-D data set.

Page 25: DWDM Unit 2 Part 2 by Jithender Tulasi

• Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-tree Structure:

• Star-Cubing algorithm for computing iceberg cubes.• Star-Cubing combines the strengths of the other

methods like top-down and bottom-up cube computation and explores both multidimensional aggregation and Apriori -like pruning.

• It operates from a data structure called a star-tree, which performs lossless data compression, thereby reducing the computation time and memory requirements.

• Star-Cubing’s approach is illustrated below for the computation of a 4-D data cube.

Page 26: DWDM Unit 2 Part 2 by Jithender Tulasi

Star-Cubing: Bottom-up computation with top-down expansion of shared dimensions.

Page 27: DWDM Unit 2 Part 2 by Jithender Tulasi

• To explain how the Star-Cubing algorithm works, we need to know more concepts,like cuboid trees, star-nodes, and star-trees.

• We use trees to represent individual cuboids.• Figure below shows a fragment of the cuboid tree

of the base cuboid, ABCD. • Each level in the tree represents a dimension, and

each node represents an attribute value.• Each node has four fields: the attribute value,

aggregate value, pointer(s) to possible descendant(s), and pointer to possible sibling.

• Tuples in the cuboid are inserted one by one into the tree.

Page 28: DWDM Unit 2 Part 2 by Jithender Tulasi

A fragment of the base cuboid tree.

Page 29: DWDM Unit 2 Part 2 by Jithender Tulasi

• A cuboid tree that is compressed using star-nodes is called a star-tree. The following is an example of star-tree construction. Star-table is constructed for each star-tree.

• Star-tree construction. A base cuboid table is shown in Table 4.1. There are 5 tuples and 4 dimensions. The cardinalities for dimensions A, B, C, D are 2, 4, 4, 4, respectively. The one-dimensional aggregates for all attributes are shown in Table 4.2. Suppose min sup = 2 in the iceberg condition. Clearly, only attribute values a1, a2, b1, c3, d4 satisfy the condition. All the other values are below the threshold and thus become star-nodes. By collapsing star-nodes, the reduced base table is Table 4.3.

Page 30: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 31: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 32: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 33: DWDM Unit 2 Part 2 by Jithender Tulasi

Aggregation Stage One: Processing of the left-most branch of BaseTree.

Page 34: DWDM Unit 2 Part 2 by Jithender Tulasi

2.Further Development of Data Cube and OLAP Technology:

2.1: Discovery-Driven Exploration of Data Cubes:• A data cube may have a large number of cuboids,

and each cuboid may contain a large number of (aggregate) cells.

• With such an overwhelmingly large space, it becomes a burden for users to even just browse a cube.

• Tools need to be developed to assist users in intelligently exploring the huge aggregated space of a data cube.

Page 35: DWDM Unit 2 Part 2 by Jithender Tulasi

• Discovery-driven exploration is such a cube exploration approach.

• In discovery driven exploration, precomputed measures indicating data exceptions are used to guide the user in the data analysis process, at all levels of aggregation.

• An exception is a data cube cell value that is significantly different from the value anticipated, based on a statistical model.

• For example, if the analysis of item-sales data reveals an increase in sales in December in comparison to all other months, this may seem like an exception in the time dimension.

Page 36: DWDM Unit 2 Part 2 by Jithender Tulasi

• Three measures are used as exception indicators to help identify data anomalies. These measures indicate the degree of surprise that the quantity in a cell holds, with respect to its expected value. They are as follows:

1. SelfExp: This indicates the degree of surprise of the cell value, relative to other cells at the same level of aggregation.

2. InExp: This indicates the degree of surprise somewhere beneath the cell, if we were to drill down from it.

3. PathExp: This indicates the degree of surprise for each drill-down path from the cell.

Page 37: DWDM Unit 2 Part 2 by Jithender Tulasi

• The InExp values can be used to indicate exceptions at lower levels that are not visible at the current level.

• For Example: Discovery-driven exploration of a data cube. Suppose that you would like to analyze the monthly sales at AllElectronics as a percentage difference from the previous month. The dimensions involved are item, time, and region. You begin by studying the data aggregated over all items and sales regions for each month, as shown in next slide.

Page 38: DWDM Unit 2 Part 2 by Jithender Tulasi

Change in sales over time.

Page 39: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 40: DWDM Unit 2 Part 2 by Jithender Tulasi

3.Attribute-Oriented Induction—An Alternative Method for Data Generalization and Concept Description:

• Data generalization summarizes data by replacing relatively low-level values with higher-level concepts.

• Concept description, which is a form of data generalization.

• A concept typically refers to a collection of data such as frequent buyers, graduate students, and so on.

Page 41: DWDM Unit 2 Part 2 by Jithender Tulasi

• It is sometimes called class description, when the concept to be described refers to a class of objects.

• Characterization provides a summarization of the given collection of data, while concept or class comparison (also known as discrimination) provides descriptions comparing two or more collections of data.

• An alternative method for concept description, called attribute oriented induction, which works for complex types of data and relies on a data-driven generalization process.

Page 42: DWDM Unit 2 Part 2 by Jithender Tulasi

3.1:Attribute-Oriented Induction for Data Characterization:

• The attribute-oriented induction (AOI) approach to concept description was first proposed in 1989.

• The data cube approach is based on materialized views of the data, which typically have been pre-computed in a data warehouse. In general, it performs off-line aggregation before an OLAP or data mining query is submitted for processing.

• On the other hand, the attribute-oriented induction approach is basically a query-oriented, generalization-based, on-line data analysis technique.

Page 43: DWDM Unit 2 Part 2 by Jithender Tulasi

• The general idea of attribute-oriented induction is to first collect the task-relevant data using a database query and then perform generalization based on the examination of the number of distinct values of each attribute in the relevant set of data.

• The generalization is performed by either attribute removal or attribute generalization.

• Aggregation is performed by merging identical generalized tuples and accumulating their respective counts. This reduces the size of the generalized data set.

• The resulting generalized relation can be mapped into different forms for presentation to the user, such as charts or rules.

Page 44: DWDM Unit 2 Part 2 by Jithender Tulasi

• The following examples illustrate the process of attribute-oriented induction.

• Example: A data mining query for characterization. Suppose that a user would like to describe the general characteristics of graduate students in the Big University database, given the attributes name, gender, major, birth place, birth date, residence, phone, and gpa (grade point average).

Page 45: DWDM Unit 2 Part 2 by Jithender Tulasi

A data mining query for this characterization can be expressed in the data mining query language, DMQL, as follows:

Page 46: DWDM Unit 2 Part 2 by Jithender Tulasi

1. Data focusing should be performed before attribute-oriented induction.

2. The data are collected based on the information provided in the data mining query.

Page 47: DWDM Unit 2 Part 2 by Jithender Tulasi

• The data mining query presented above is transformed into the following relational query for the collection of the task-relevant set of data:

Page 48: DWDM Unit 2 Part 2 by Jithender Tulasi

• Attribute-oriented induction. Here we show how attribute-oriented induction is performed on the initial working relation of Table 4.12. For each attribute of the relation, the generalization proceeds as follows:

1. name: Since there are a large number of distinct values for name and there is no generalization operation defined on it, this attribute is removed.

2. gender: Since there are only two distinct values for gender, this attribute is retained and no generalization is performed on it.

3. Major: Suppose that a concept hierarchy has been defined that allows the attribute major to be generalized to the values farts&science, engineering, businessg.

Page 49: DWDM Unit 2 Part 2 by Jithender Tulasi

4. birth place: This attribute has a large number of distinct values; therefore, we would like to generalize it. Similarly consider others.

Page 50: DWDM Unit 2 Part 2 by Jithender Tulasi

Presentation of the Derived Generalization:• The descriptions can be presented to the user in a

number of different ways. • Generalized descriptions resulting from attribute-

oriented induction are most commonly displayed in the form of a generalized relation.

Example of Generalized relation (table).• Suppose that attribute-oriented induction was

performed on a sales relation of the AllElectronics database, resulting in the generalized description of Table 4.14 for sales in 2004. The description is shown in the form of a generalized relation.

Page 51: DWDM Unit 2 Part 2 by Jithender Tulasi

Descriptions can also be visualized in the form of cross-tabulations, or crosstabs.

Page 52: DWDM Unit 2 Part 2 by Jithender Tulasi

• Generalized data can be presented graphically, using bar charts, pie charts, and curves.

• Visualization with graphs is popular in data analysis. Such graphs and curves can represent 2-D or 3-D data.

• Example for Bar chart and pie chart using the sales data of the crosstab shown in next slide which can be transformed into the bar chart representation and the pie chart representation.

Page 53: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 54: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 55: DWDM Unit 2 Part 2 by Jithender Tulasi

• The size of a cell (displayed as a tiny cube) represents the count of the corresponding cell, while the brightness of the cell can be used to represent another measure of the cell, such as sum (sales).

• A generalized relation may also be represented in the form of logic rules.

• A logic rule that is associated with quantitative information is called a quantitative rule.

Page 56: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 57: DWDM Unit 2 Part 2 by Jithender Tulasi

Example of Quantitative characteristic rule:• The crosstab can be transformed into logic rule

form. Let the target class be the set of computer items. The corresponding characteristic rule, in logic form, is:

Page 58: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 59: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 60: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 61: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 62: DWDM Unit 2 Part 2 by Jithender Tulasi
Page 63: DWDM Unit 2 Part 2 by Jithender Tulasi

Thank You