Attribute Oriented Induction

Class Concept Description: Characterization and DiscriminationData entries can be associated with classes or concepts. For example, in the AllElectronics store, classes

of items for sale include computers and printers, and concepts of customers include bigSpenders and

budgetSpenders.

It can be useful to describe individual classes and concepts in summarized, concise, and yet precise

terms. Such descriptions of a class or a concept are called class/concept descriptions. These descriptions

can be derived using

1. Data Characterization/Generalization/Summarization

2. Data Discrimination,

3. Both data characterization and discrimination.


Data characterization is a summarization of the general characteristics or features of a target class of

data. The data corresponding to the user-specified class are typically collected by a database query.

For example, to study the characteristics of software products whose sales increased by 10% in the last

year, the data related to such products can be collected by executing on SQL query. There are several

methods for effective data summarization and characterization.

Several methods to achieve Data Characterization

I. Simple data summaries based on statistical measures and plots

II. The data cube—based OLAP roll-up operation to perform user-controlled data summarization

along a specified dimension.

III. An attribute-oriented induction technique (without step-by-step user interaction)

The output of data characterization can be presented in various forms. Examples include pie charts, bar

charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.

The resulting descriptions can also be presented as generalized relations or in rule form (called

characteristic rules).


Data discrimination is a comparison of the general features of target class data objects with the general

feature of objects from one or a set of contrasting classes. The target and contrasting classes can be

specified by the user, and the corresponding data objects retrieved through database queries.

1

For example, the user may like to compare the general features of software products whose sales

increased by 10% in the last year with those whose sales decreased by at least 30% during the same

period.

Discrimination descriptions should include comparative measure that help distinguish between the target

and contrasting classes. Discrimination descriptions expressed in rule form are referred to as

discriminant rules.

Attribute Oriented Induction for Data Characterization Proposed in 1989 (KDD ‘89 workshop)

Not confined to categorical data nor particular measures.

How it is done?

o Collect the task-relevant data( initial relation) using a relational database query

o Perform generalization by attribute removal or attribute generalization.

o Apply aggregation by merging identical, generalized tuples and accumulating their

respective counts.

o Interactive presentation with users.

Basic Principles of Attribute-Oriented Induction

Data focusing: task-relevant data, including dimensions, and the result is the initial relation.

Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no

generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes.

Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of

generalization operators on A, then select an operator and generalize A.

Attribute-threshold control: typical 2-8, specified/default.

Generalized relation threshold control: control the final relation/rule size

Example:

DMQL: Describe general characteristics of graduate students in the Big-University database

use Big_University_DB

mine characteristics as “Science_Students”

in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa

from student

where status in “graduate”2

Corresponding SQL statement:

Select name, gender, major, birth_place, birth_date, residence, phone#, gpa

from student

where status in {“Msc”, “MBA”, “PhD” }

Class Characterization: An Example

Initial Relation

Prime Generalized Relation

Presentation of Generalized Results

Generalized relation:

o Relations where some or all attributes are generalized, with counts or other aggregation

values accumulated.

Cross tabulation:

o Mapping results into cross tabulation form (similar to contingency tables).

o Visualization techniques:

o Pie charts, bar charts, curves, cubes, and other visual forms.3

Quantitative characteristic rules:

o Mapping generalized result into characteristic rules with quantitative information

associated with it, e.g.,

vt-weight:

o Interesting measure that describes the typicality of

each disjunct in the rule

each tuple in the corresponding generalized relation

n – number of tuples for target class for generalized relation

qi … qn – tuples for target class in generalized relation

qa is in qi … qn

Presentation—Generalized Relation

Presentation—Crosstab

4

Implementation by Cube Technology

Construct a data cube on-the-fly for the given data mining query

o Facilitate efficient drill-down analysis

o May increase the response time

o A balanced solution: precomputation of “subprime” relation

Use a predefined & precomputed data cube

o Construct a data cube beforehand

o Facilitate not only the attribute-oriented induction, but also attribute relevance analysis,

dicing, slicing, roll-up and drill-down

o Cost of cube computation and the nontrivial storage overhead

Characterization vs. OLAP

Similarity:

o Presentation of data summarization at multiple levels of abstraction.

o Interactive drilling, pivoting, slicing and dicing.

Differences:

o Automated desired level allocation.

o Dimension relevance analysis and ranking when there are many relevant dimensions.

o Sophisticated typing on dimensions and measures.

o Analytical characterization: data dispersion analysis.

Analytical Characterization/Attribute Relevance AnalysisIn reality there are many attributes in data, but all are not important. So, we have to find the important

attributes for analysis.

Require take decision as follows?

Which dimensions should be included?

How high level of generalization?

Automatic vs. interactive

Reduce no. of attributes; easy to understand patterns

There are various ways to achieve this like

5

Statistical method for preprocessing data

o Filter out irrelevant or weakly relevant attributes

o Retain or rank the relevant attributes

Relevance related to dimensions and levels

Analytical characterization, analytical comparison

Procedure for Attribute Relevance Analysis

Data Collection

Analytical Generalization

o Use information gain analysis (e.g., entropy or other measures) to identify highly relevant

dimensions and levels.

Relevance Analysis

o Sort and select the most relevant dimensions and levels.

Attribute-oriented Induction for class description

o On selected dimension/level

OLAP operations (e.g. drilling, slicing) on relevance rules

Quantitative relevance measure determines the classifying power of an attribute within a set of data.

Methods

Information gain (ID3)

Gain ratio (C4.5)

Gini index

c2 contingency table statistics

Uncertainty coefficient

6

Attribute Oriented Induction

Documents

Transcript of Attribute Oriented Induction