Attribute Oriented Induction
-
Upload
jason-moreno -
Category
Documents
-
view
10 -
download
6
description
Transcript of Attribute Oriented Induction
![Page 1: Attribute Oriented Induction](https://reader035.fdocuments.in/reader035/viewer/2022081817/577cc7651a28aba711a0cea8/html5/thumbnails/1.jpg)
Class Concept Description: Characterization and DiscriminationData entries can be associated with classes or concepts. For example, in the AllElectronics store, classes
of items for sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders.
It can be useful to describe individual classes and concepts in summarized, concise, and yet precise
terms. Such descriptions of a class or a concept are called class/concept descriptions. These descriptions
can be derived using
1. Data Characterization/Generalization/Summarization
2. Data Discrimination,
3. Both data characterization and discrimination.
1. Data Characterization/Generalization/Summarization
Data characterization is a summarization of the general characteristics or features of a target class of
data. The data corresponding to the user-specified class are typically collected by a database query.
For example, to study the characteristics of software products whose sales increased by 10% in the last
year, the data related to such products can be collected by executing on SQL query. There are several
methods for effective data summarization and characterization.
Several methods to achieve Data Characterization
I. Simple data summaries based on statistical measures and plots
II. The data cube—based OLAP roll-up operation to perform user-controlled data summarization
along a specified dimension.
III. An attribute-oriented induction technique (without step-by-step user interaction)
The output of data characterization can be presented in various forms. Examples include pie charts, bar
charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
The resulting descriptions can also be presented as generalized relations or in rule form (called
characteristic rules).
2. Data Characterization/Generalization/Summarization
Data discrimination is a comparison of the general features of target class data objects with the general
feature of objects from one or a set of contrasting classes. The target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved through database queries.
1
![Page 2: Attribute Oriented Induction](https://reader035.fdocuments.in/reader035/viewer/2022081817/577cc7651a28aba711a0cea8/html5/thumbnails/2.jpg)
For example, the user may like to compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by at least 30% during the same
period.
Discrimination descriptions should include comparative measure that help distinguish between the target
and contrasting classes. Discrimination descriptions expressed in rule form are referred to as
discriminant rules.
Attribute Oriented Induction for Data Characterization Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
o Collect the task-relevant data( initial relation) using a relational database query
o Perform generalization by attribute removal or attribute generalization.
o Apply aggregation by merging identical, generalized tuples and accumulating their
respective counts.
o Interactive presentation with users.
Basic Principles of Attribute-Oriented Induction
Data focusing: task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes.
Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of
generalization operators on A, then select an operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final relation/rule size
Example:
DMQL: Describe general characteristics of graduate students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student
where status in “graduate”2
![Page 3: Attribute Oriented Induction](https://reader035.fdocuments.in/reader035/viewer/2022081817/577cc7651a28aba711a0cea8/html5/thumbnails/3.jpg)
Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Class Characterization: An Example
Initial Relation
Prime Generalized Relation
Presentation of Generalized Results
Generalized relation:
o Relations where some or all attributes are generalized, with counts or other aggregation
values accumulated.
Cross tabulation:
o Mapping results into cross tabulation form (similar to contingency tables).
o Visualization techniques:
o Pie charts, bar charts, curves, cubes, and other visual forms.3
![Page 4: Attribute Oriented Induction](https://reader035.fdocuments.in/reader035/viewer/2022081817/577cc7651a28aba711a0cea8/html5/thumbnails/4.jpg)
Quantitative characteristic rules:
o Mapping generalized result into characteristic rules with quantitative information
associated with it, e.g.,
vt-weight:
o Interesting measure that describes the typicality of
each disjunct in the rule
each tuple in the corresponding generalized relation
n – number of tuples for target class for generalized relation
qi … qn – tuples for target class in generalized relation
qa is in qi … qn
Presentation—Generalized Relation
Presentation—Crosstab
4
![Page 5: Attribute Oriented Induction](https://reader035.fdocuments.in/reader035/viewer/2022081817/577cc7651a28aba711a0cea8/html5/thumbnails/5.jpg)
Implementation by Cube Technology
Construct a data cube on-the-fly for the given data mining query
o Facilitate efficient drill-down analysis
o May increase the response time
o A balanced solution: precomputation of “subprime” relation
Use a predefined & precomputed data cube
o Construct a data cube beforehand
o Facilitate not only the attribute-oriented induction, but also attribute relevance analysis,
dicing, slicing, roll-up and drill-down
o Cost of cube computation and the nontrivial storage overhead
Characterization vs. OLAP
Similarity:
o Presentation of data summarization at multiple levels of abstraction.
o Interactive drilling, pivoting, slicing and dicing.
Differences:
o Automated desired level allocation.
o Dimension relevance analysis and ranking when there are many relevant dimensions.
o Sophisticated typing on dimensions and measures.
o Analytical characterization: data dispersion analysis.
Analytical Characterization/Attribute Relevance AnalysisIn reality there are many attributes in data, but all are not important. So, we have to find the important
attributes for analysis.
Require take decision as follows?
Which dimensions should be included?
How high level of generalization?
Automatic vs. interactive
Reduce no. of attributes; easy to understand patterns
There are various ways to achieve this like
5
![Page 6: Attribute Oriented Induction](https://reader035.fdocuments.in/reader035/viewer/2022081817/577cc7651a28aba711a0cea8/html5/thumbnails/6.jpg)
Statistical method for preprocessing data
o Filter out irrelevant or weakly relevant attributes
o Retain or rank the relevant attributes
Relevance related to dimensions and levels
Analytical characterization, analytical comparison
Procedure for Attribute Relevance Analysis
Data Collection
Analytical Generalization
o Use information gain analysis (e.g., entropy or other measures) to identify highly relevant
dimensions and levels.
Relevance Analysis
o Sort and select the most relevant dimensions and levels.
Attribute-oriented Induction for class description
o On selected dimension/level
OLAP operations (e.g. drilling, slicing) on relevance rules
Quantitative relevance measure determines the classifying power of an attribute within a set of data.
Methods
Information gain (ID3)
Gain ratio (C4.5)
Gini index
c2 contingency table statistics
Uncertainty coefficient
6