An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data...

An IntroductionAn Introduction

Student Name:Student Name: Riaz Ahmad Riaz Ahmad

Program:Program: MSIT(2005-2007) MSIT(2005-2007)

Subject: DataSubject: Data warehouse & Data Mining warehouse & Data Mining

Concerned terms with my Concerned terms with my ResearchResearch

KDD processKDD process Data warehouseData warehouse Materialized or Indexed viewMaterialized or Indexed view Data MiningData Mining Data Mining TechniquesData Mining Techniques ClassificationClassification Decision TreeDecision Tree ID3 algorithmID3 algorithm ObjectiveObjective ConclusionConclusion

KDD processKDD process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Steps of a KDD ProcessSteps of a KDD Process Learning the application domain:Learning the application domain:

relevant prior knowledge and goals of application Creating a target data set:Creating a target data set: data selection data selection Data cleaning and preprocessing:Data cleaning and preprocessing: (may take 60% of effort!) (may take 60% of effort!) Data reduction and transformation:Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data miningChoosing functions of data mining summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)Choosing the mining algorithm(s) Data mining:Data mining: search for patterns of interest search for patterns of interest Pattern evaluation and knowledge presentationPattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc. Use of discovered knowledgeUse of discovered knowledge

Data WarehouseData Warehouse Data warehouse is a sub-oriented, integrated or consolidated, Non-Data warehouse is a sub-oriented, integrated or consolidated, Non-

volatile or read only, time-variant collection of data designed to supportvolatile or read only, time-variant collection of data designed to support

management DSS needs.management DSS needs.

Any read-only collection of accumulated historical data is called a data Any read-only collection of accumulated historical data is called a data

warehouse.warehouse.

A data warehouse is a database specifically structured for query and A data warehouse is a database specifically structured for query and

analysis.analysis.

A data warehouse typically contains data representing the businessA data warehouse typically contains data representing the business

history of an organizationhistory of an organization. .

Materialized or Indexed ViewMaterialized or Indexed View A materialized view is a special type of summary table that is A materialized view is a special type of summary table that is

constructed by aggregating one or more columns of data from aconstructed by aggregating one or more columns of data from a

a single table, or a series of tables that are joined togethera single table, or a series of tables that are joined together Materialized views can dramatically improve query performance,Materialized views can dramatically improve query performance, and significantly decrease the load on the system. and significantly decrease the load on the system.

You need it in data warehouse environment more that OLTP You need it in data warehouse environment more that OLTP environment you need it in huge databases, and not a table thatenvironment you need it in huge databases, and not a table that has 3 records.has 3 records.

Data MiningData Mining Data mining (DM) is defined as the process of discovering patterns Data mining (DM) is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some The patterns discovered must be meaningful in that they lead to some advantage.advantage.

We simply define; data mining refers to extracting or mining “knowledge We simply define; data mining refers to extracting or mining “knowledge from large amounts of data. from large amounts of data.

There are many other terms such as knowledge mining from databases, There are many other terms such as knowledge mining from databases, knowledge extraction, data/pattern analysis, data archaeology, and dataknowledge extraction, data/pattern analysis, data archaeology, and data dredging. dredging.

Data mining is concerned with finding hidden relationships present in Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. business data to allow businesses to make predictions for future use.

Data Mining Tasks or Data Mining Tasks or TechniquesTechniques

ClassificationClassification

RegressionRegression

SegmentationSegmentation

AssociationAssociation

ForecastingForecasting

Text AnalysisText Analysis

Advanced Data ExplorationAdvanced Data Exploration

We only select classification among different data mining techniques or tasks for We only select classification among different data mining techniques or tasks for the research work.the research work.

ClassificationClassification

Classification is data analysis which can be used to extract modelsClassification is data analysis which can be used to extract models describing important data classes or to predict future data trends. describing important data classes or to predict future data trends.

Given a number of pre-specified classes. Examine a new object, Given a number of pre-specified classes. Examine a new object, record, or individual and assign it, based on a model, to one ofrecord, or individual and assign it, based on a model, to one of these classesthese classes

ExamplesExamples

Which credit applicants are low, medium, high risk? Which hotel customers are likely, unlikely to return? Which residents are likely, unlikely to vote?

Classification TechniquesClassification Techniques

• Decision Tree based MethodsDecision Tree based Methods

• Rule-based MethodsRule-based Methods

• Memory based reasoningMemory based reasoning

• Neural NetworksNeural Networks

• Genetic algorithmsGenetic algorithms

• Bayesian networksBayesian networks

Among these classification techniques we only select the decision tree Among these classification techniques we only select the decision tree

Decision TreeDecision Tree A decision tree is a flow-chart-like tree structure, where each internal node A decision tree is a flow-chart-like tree structure, where each internal node

denotes a test on an attribute, each branch represents an outcome of the test, and leaf denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes . The topmost node in a tree is the root nodenodes represent classes . The topmost node in a tree is the root node. .

A decision tree is a tree in which each branch node represents a choice between a A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome possible scenario of decision and its outcome

Decision Tree

Generating Classification rules from

Decision Tree

IF age = “<30" AND student = no THEN buys computer = no

IF age = “<30" AND student = yes THEN buys computer = yes

IF age = “30-40" THEN buys computer = yes

IF age = “>40" AND credit rating = excellent THEN buys computer = yes

IF age = “>40" AND credit rating = fair THEN buys computer = no

ID3 algorithmOriginator of the ID3 Algorithm ID3 and its successors have been developed by Ross Quinlan, who discovered it in the 1970s.

Implementation of ID3 Algorithm ID3 (Learning Sets S, Attributes Sets A, Attributes values V) Return Decision Tree. Begin Load learning sets first, create decision tree root node 'root Node', Add learning set S into Root node as its subset. For root Node, we compute Entropy (rootNode. subset) first If Entropy (rootNode. subset) = =0, then RootNode. subset consists of records all with the same value for the categorical Attribute, return a leaf node with decision attribute: attribute value; If Entropy (rootNode. subset)! =0, then Compute information gain for each attribute left (have not been used in splitting), Find attribute A with Maximum (Gain(S, A)). Create child nodes of this root Node and add to root Node in the decision tree. For each child of the root Node, apply ID3(S, A, V) recursively until reach Node that has entropy=0 or reach Leaf node.End ID3.

Mathematical Formulae

The following mathematical formulae are used for the calculation of Entropy and Gain.

Entropy Equation

Information Gain Equation

Classification Experiments

First we take a training dataset ( S ) For classification purpose

Step(1) Entropy of Original Dataset

Entropy calculation process of dataset is shown below.

First decides the number of records which have (No) class value that are five (5) while the (yes) class value records are Nine (9). Total number of records is fourteen (14).

Relative frequency of No class: 5/14.Relative frequency of Yes class: 9/14.

Entropy of S dataset is calculated by the above Entropy formula.

Entropy (5, 9) = -5/14 log2 5/14 – 9/14 log2 9/14 = 0.9403

Step (2) Calculate The Gain of Each input Attribute in Dataset

The following information is required for the Gain Calculation of Outlook Attribute

For the calculation of attribute gain first checked the number of values for this attribute, and then on the basis of each value the S dataset is classified. Outlook attribute have three values like rain, overcast and sunny. There are three subset is possible of S dataset on the basis of outlook attribute values.

I. First subset(S1) contains five (5) records on the basis of rain value of outlook attribute.II. Second subset (S2) contains four (4) records on the basis of overcast value of outlook attribute.III. Thirds subset (S3) contains five (5) records on the basis of sunny value.

Proportionality measure for S1 is 5/14Proportionality measure fro S2 is 4/14Proportionality measure for S3 is 5/14

Step (2) Calculate The Gain of Each input Attribute in Dataset

The following steps are required for the calculation of Gain value of Each Attribute in original dataset (S)

Calculate the Entropy of each subset ( S1, S2, S3)

Calculate the attribute Entropy ( Outlook )

Calculate the Gain of Attribute ( Outlook )

Calculate the Entropy of each subset (S1,S2,S3)

In first set S1 the three (3) yes class and (2) No class. Total is five records (5)

Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5= 0.971

In second set S2 the four (4) yes class. Total is four records (4)

Entropy (4,4) = -4/4 log2 4/4 = 0

In third set S3 the three (3) NO class and two (2) Yes class. Total is five records (5).

Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971

Calculate the Entropy of the outlook Attribute

The following formula is used for calculation

Entropy (S1, S2, S3) = S1/S * Entropy (S1) + S2/S * Entropy (S2) + S3/S * Entropy (S3)

Entropy (5, 4, 5) or Entropy (outlook) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971

= 0.694

Calculate the Gain value of the outlook Attribute

The following formula is used for calculation

Gain (S, A) = Entropy (S) – Entropy (outlook)

Gain(S, outlook) = 0.9403 - 0.694

= 0.2463

The above Three steps Repeats for other three remaining input attributes. The following tables contain the Gain of attributes for Original set, Rain subset, and Sunny subset.

Attributes Along with Gain Information

(Original set, rain and sunny subsets)

Step (3) Select the Maximum Gain value Attribute for the classification of dataset (S)

In the above Table the attribute which have the maximum gain value is the outlook. The Gain value of outlook is 0.2463 which is the highest value. After this process we can split the dataset into three different subsets on the basis of outlook attribute values which are following. Rain, overcast and sunny. The classification is show the complete classification process which generate the decision tree or classification rules.

Decision Tree

Decision Tree developed in my work

Objectives

1. To integrate the decision tree with Data warehouse or database

2. To reduce the time of construction of decision tree at root node.

The computational process for constructing tree is highly complex and recursive in nature. It includes calculating various values i.e. Entropy of dataset, Entropy and Gain values of each input attributes in dataset repeatedly. Here, I have pre-computed results required at least for the selection and classification of root node.

There is no change in the intelligence approach, only the values required are stored, instead of calculating them at run time in memory. However, this integration has given a jump start for the construction of the classification model, enhancing the overall efficiency of the model.

Pre-Calculated Structure

Conclusion

Classification algorithms are memory resident, calculating various statistical values at runtime. Storage of these statistical values, even for the selection of the best attribute at the root node, greatly increases the performance of the classification algorithms.

Materialized view will hold the input training dataset while these statistical values will be stored in a dependent table. This table will be updated according to the policy chosen. Modern data warehouses offer a many methods to update the materialized view.

However, each time a new target class is introduced or new data is loaded in this containing the statistical values will be updated accordingly. The accuracy of the algorithm is in no way affected, not in a positive or negative direction. The significant improvement introduced is in the efficiency, in selection of the root level attribute.

An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data...

Documents

Transcript of An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data...