Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr

23/4/22 Chap4 Inductive Learning Zhongzhi Shi 1

Advanced Computing Seminar Data Mining and Its

Industrial Applications — Chapter 4 —

Inductive LearningZhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr

Knowledge and Software Engineering LabAdvanced Computing Research Centre

School of Computer and Information ScienceUniversity of South Australia

)


Outline

Introduction Machine learning Version space and bias Decision tree learning Ripper algorithm Summary


Basic Concepts

Data: Store on any media with certain format

Information: Assign meaning to concrete data

knowledge: Refine from information


Why Data Mining?Why Data Mining?

Rich Data, Poor Knowledge

Data KnowledgeKnowledge Decision Decision MakingMaking

Pattern Trends Concept Relation Model Association

Rules Sequence

E-commerce Resource

distribution Trade Business

Intelligence E-Science

Finance Economic Government Post Population Life cycle


Data Mining vs Knowledge Discovery

Data mining Extraction of interesting (non-trivial,

implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.


Data Mining: A KDD Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation


Data Warehouse ProcessOrganizationReadiness

Assessment

BusinessStrategyDefinition

DataWarehouseArchitecture

DefinitionData

WarehouseInfrastructure

Design

Design andBuild

DataExploitation

Implementation

• Meta data management• Data access• Systems Integration


Macro Picture

Data Mining Approach to Data Warehouse Design

Desired star schema

Attribute• Width• Type• NULL allowed• Name• Key

Numeric• Maximum• Minimum• Average• Standard deviation

Text fields• Number of spaces• Numerals used• Average length

Designed Star SchemaMapping Rules


Detailed picture

InfoSource

1

Extractor

Similarity Calculator

Attribute Classifier

Integrator

InfoSource

1Info

Source1

Translator

DesiredStar Schema

Designed Star SchemaMapping rules


Knowledge Representation

Production system Frame Semantic networks First order logic Ontology


Production System

Rules IF (conditions) Then (conclusions)

If ( animal has wing) and (animal can fly) Then (animal is a bird)


Production System

MYCIN

$<rule> = IF <antecedent> THEN <action> (ELSE <action>$

$<antecedent> = AND <condition>$

$<condition> = OR <condition> | <predicate> <associative-tripe>$

$<associative-tripe> = <attribute> <object> <value>$

$<action> = <consequent>) | <procedure>$

$<consequent> = <associative-triple> <certainty-factor>$


Frame Structure

FRAME FRAME-NAME

SLOT-NAME-1: ASPECT-11 ASPECT-VALUE-11

ASPECT-12 ASPECT-VALUE-12

ASPECT-1m AWPECT-VALUE-1m

...... ......

SOLT-NAME-n: ASPECT-n1 ASPECT VALUE-n1

ASPECT-n2 ASPECT-VAPECT-VALUE-n2

ASPECT-n1 ASPECT-VALUE-n1


Semantic Networks

node: objects arc: relationships


First Order Logic

Student(John) Teacher(Markus) Father(x,y) Father(y,z) Grandfather(x,z):-Father(x,y),Father(y,z) If ( animal has wing) and (animal can fly) Then (animal is a bird)


Ontology

Semantic Web:

Ontology OWL Ontology schema Description Logic


Outline



The Essence of Learning

Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more efficiently and more effectively the next time. [Simon 1983]

Machine learning is the study of how to make machines acquire new knowledge, new skills, and reorganize existing knowledge.



The environment supplies the source information to the learning system. The level and quality of the information will significantly affect the learning strategy.

Feedback

Environment

Learning Element

Knowledge Base

Performance Element



The environment = Information source

Database Text Web pages Image Video Space data


The Essence of Learning The learning element uses this information to

make improvements in an explicit knowledge base, and the performance element uses the knowledge base to perform its task.

Inductive learning Analogical Learning Explanation Learning Genetic algorithm Neural network


Paradigms for Machine Learning

The inductive paradigm The most widely studied method for symbolic learning is one

of inducing a general concept description from a sequence of instances of the concept and known counterexamples of the concept. The task is to build a concept description from which all the previous positive instances can be rederived by universal instantiation but none of the previous negative instances can be rederived by the same process.

The analogical paradigm Analogical reasoning is a strategy of inference that allows the

transfer of knowledge from a known area into another area with similar properties.


Paradigms for Machine Learning

The analytic paradigm The methods attempt to formulate a generalization after analyzing

few instances in terms of the systems's knowledge. Mainly deductive rather than inductive mechanisms are used for such learning.

The genetic paradigm Genetic algorithms have been inspired by a direct analogy to

mutations in biological reproduction and Darwinian natural selection. In principle, genetic algorithms encode a parallel search through concept space, with each process attempting coarse-grain hill climbing.

The connectionist paradigm Connectionist learning systems, also called ``neural networks“.

Connectionist learning consists of readjusting weights in a fixed-topology network via specific learning algorithms



The knowledge base contains predefined concepts, domain constrains heuristic rules and so on.

Knowledge representation Knowledge consistence Knowledge redundancy



The performance element. The learning element is trying to improve the action of the performance element. The performance element applies knowledge to solve problems and evaluate the learning effects.


On Concept The term ``concept" is an universal notion which reflects a general,abstract, and essential features. For example, ``triangle", ``animal",``computer", all of them are concept. Horse, tiger, bird and so on arecalled as example of the concept ``animal".

Concept contains two meanings, extension and intension. Intension. The set of attributes which reflect the essential features

of a concept is called intension. Extension. The set of examples which satisfy the definition of a concept is called extension.

Fruit Student


Concept Description In general, a concept can be described by the concept name, and

list of the attributes and attribute-value pairs, that is,

(Concept name (Attribute 1 Value1) (Attribute2 Value2) … (Attributen Valuen)

In addition, concept description can be represented by first order logic.

Each attribute is a predicate, concept name and attribute value can be viewed as arguments. Concept description is represented by predicate calculus


Attribute Types Nominal attribute is one that

takes on a finite, unordered set of mutually exclusive values.

Linear attribute Structured attribute


Attribute Types Nominal attribute is one that

takes on a finite, unordered set of mutually exclusive values.

For examples• Color: red, green, blue• Traffic: airline, railway, ship


Attribute Types Linear attribute

For examples• Age: 1,2,…100• Temperature: 20, 21,… • Distance: 1km, 2km,…


Attribute Types Structured attribute For examples:• Tree structure •

computer

Hardware Software

CPU Memory

Computing Control


Inductive Learning From particular examples to general

conclusion, principle, rule

apple eat tomato eat banana eat … … fruit eat


Inductive Learning Given: • Premise statements. Consists of facts, specific observations, intermediate generalizations that provide information about some objects, phenomena, processes, and so on. • Tentative inductive assertion. Provides a priori hypothesis held about the objects in the premise statement. • Background knowledge. Contains general and domain-specific concepts for interpreting the premises and

inference rules relevant to the task of inference Find: Inductive assertion (hypothesis). It strongly or weakly

implies the premise statements in the context of background knowledge and satisfies the preference criterion.


Inductive Learning • Simplest form: learn a function from examples

f is the target function

An example is a pair (x, f(x))

Problem: find a hypothesis hsuch that h ≈ fgiven a training set of examples

(This is a highly simplified model of real learning:– Ignores prior knowledge– Assumes examples are given)–

•


Inductive Learning Method • Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:••


Best-Hypothesis Positive example generalize Negative example specialize Drawbacks: check previous examples & backtrack


Outline



Hypothesis Space Concept description Extension

a certain set of examples predicted to be satisfied by the hypothesis

Bias any preference for one hypothesis over another


Training Examples for Enjoy Sport

Sky Temp Humidity Wind Water Forecast EnjoySport

Sunny Warm Normal Strong Warm Same YESSunny Warm High Strong Warm Same YESRainy Cold High Strong Warm Change NOSunny Warm High Strong Cool Change YES

What is the general concept?


is more_general_than_or_equal_to relation

Definition of more_general_than_or_equal_to relation:Let hj and hk be boolean-valued functions defined over X. Then hj is more_general_than_or_equal_to hk (hj g hk) iff

(xX) [(hk(x)=1)(hj(x)=1)]

In our case the most general hypothesis - that every day is a positive example - is represented by ?, ?, ?, ?, ?, ?,and the most specific possible hypothesis - that no day is positive example - is represented by , , , , , .


Example of the Ordering of Hypotheses


Version Space Search


Version Space Example


Representing Version Space

The General boundary, G, of version space VSH,E, is the set of its maximally general members

The Specific boundary, S, of version space VSH,E, is the set of its maximally specific members

Every member of the version space lies between these boundariesVSH,E, = {hH | (sS) (gG) (ghs)} where xy means x is more general or equal to y


Candidate-elimination algorithm

1 Initilize H to be the whole space. Thus, the G set contains only the null description, and the S set is consistent with the first observed positive training instance.

2. For each subsequent instance, i, BEGIN IF i is a positive instance, THEN BEGIN Retain in G only those generalizations which match I. Update S to generalize the elements in S as little as possible, so that they will match i.


Candidate-elimination algorithm

ELSE IF i is a negative instance, THEN BEGIN Retain in S only those generalizations which do not match I. Update G to specialize the elements in G as little as possible, so that they will not match i.3 Repeat step 2 until G = S and this is a singleton set. When this occurs, H has

collapsed to include only a single concept.4 Output H.


Converging Boundaries of the G and S sets


Example Trace (1)


Example Trace (2)


Example Trace (3)


Example Trace (4)


How to Classify new Instances? New instance i is classified as a positive

instance if every hypothesis in the current version space classifies it as positive.

Efficient test - iff the instance satisfies every member of S

New instance i is classified as a negative instance if every hypothesis in the current version space classifies it as negative.

Efficient test - iff the instance satisfies none of the members of G


New Instances to be Classified

A Sunny, Warm, Normal, Strong, Cool, Change (YES)B Rainy, Cold, Normal, Light, Warm, Same (NO)C Sunny, Warm, Normal, Light, Warm, Same (Ppos(C)=3/6)

D Sunny, Cold, Normal, Strong, Warm, Same (Ppos(C)=2/6)


Remarks on Version Space and Candidate-Elimination

The algorithm outputs a set of all hypotheses consistent with the training examples iff there are no errors in the training data iff there is some hypothesis in H that

correctly describes the target concept The target concept is exactly learned when the

S and G boundary sets converge to a single identical hypothesis.

Applications learning regularities in chemical mass

spectroscopy learning control rules for heuristic search


Drawbacks of Version Space

Assume consistent training data Noise-sensitive Comments on version space

though not practical in most real-world learning problems, they provide a good deal of insight into the logical structure of hypothesis space


Version-Space Merging

VS1 VS2

S1 S2

G1 G2

G1 2

S1 2

VS1 2


Version-Space Merging

Conceptional each new piece of information new version space

Practical parallel ambiguous, inconsistent data, background domain theories

VSMVSI

VSn


IVSM Examples

any-shape

Polyhedron Spheroid

any-size

Large Small

Cube Pyram

id Octoploid


IVSM Examples Example Instance S Instance G Resulting S Resulting G

[S,C] [S,C] [ ？ , ？ ] [S,C] [ ？ , ？ ]X [S,Sp] [L,?] [?,Po][S,C] [?,Po]X [L,O] [S,?] [?,C][S,C] [?,C] [S,Py]

[?,Py] [?,C] [S,P] [S,P] [ ？ , ？ ] [S,C] [S,Po]

[S,Py]


Bias

Definition any basis for choosing one generalization over another any factor that influences the definition or selection of

inductive hypotheses Representational bias

lauguage, language implementation, primitive terms Procedural (algorithmic) bias

order of traversal of the states in the space defined by a representational bias


Bias

Program

Training set

Search Knowledge

Bias

Training Examples

Hypothesis


Bias Selection & Evaluation

Real-world domains have potentially hundreds of features and sources of data

Why is bias selection important? improve the predictive accuracy of the learner improve performance goals

Selection: static vs. dynamic Evaluation: basis for bias selection

online and empirical vs. offline and analytical


Multi-Tiered Bias System

Bias shifting bias selection occurs again after learning has begun useful when the knowledge for bias selection is not available

prior to learning, but can be gathered during learning Multi-tierd bias

make embedded biases explicit! reduce the cost of system and knowledge engineering flexible system design, conceptual simplicity

Characterize learning as search within multiple tiers!


Multi-Tiered Bias Search Space

L(H)

H

P(l(H)))

P(l(L(H))) L(L(H)) L(P(l(H))) P(l(P(l(H))))

RepresentationalBias Space

ProceduralBias Space

Hypothesis Space

Procedural Meta-Bias Spaces

Representational Meta-Bias Spaces


Outline



Decision Tree Learning1966 Hunt, Marin, Stone: CLS1983 Quinlan: ID31986 Schlimmer, Fisher: ID4 ，

Incremental learning1988 Utgoff: ID51993 Quinlan: C4.5, C5


Play tennis: Training examplesDay Outlook Temperature Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong

YesD13 Overcast Hot NormalWeak YesD14 Rain Mild High Strong No


CLS learning algorithm

Decision tree each internal node tests an attribute each branch corresponds to attribute

value each leaf node assigns a classification

Decision trees are inherently disjunctive, since each branch leaving a decision node corresponds to a separate disjunctive case. Decision trees can be used to represent disjunctive concepts.


CLS learning algorithm

The CLS algorithm starts with an empty decision tree and gradually refines it, by adding decision nodes, until the tree correctly classifies all the training instances. The algorithm operates over a set of training instances, C, as follows:

If all instances in C are positive, then create a YES node and halt. If all instances in C are negative, create a NO node and halt. Otherwise, select (using some heuristic criterion) an attribute, A, with values v1,…,vn and create the decision tree.

Partition the training instances in C into subsets C1,…,Cn

according to the values of V. Apply the algorithm recursively to each of the sets Ci.


ID3 Approach

ID3 algorithm build decision tree based on training

objects with known class labels to classify testing objects

rank attributes with information gain measure

minimal height the least number of tests to classify an

object


Decision Tree Representation Representation:

Internal node test on some property (attribute) Branch corresponds to attribute value Leaf node assigns a classification

Decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances

(Outlook = Sunny Humidity = Normal) (Outlook = Overcast)

(Outlook = Rain Wind = Weak)


Decision Tree Example


Appropriate problems for decision Trees

Instances are represented by attribute-value pairs

Target function has discrete output values Disjunctive hypothesis may be required Possibly noisy training data

data may contain errors data may contain missing attribute

values


Learning of Decision TreesTop-Down Induction of Decision Trees

Algorithm: The ID3 learning algorithm (Quinlan, 1986)If all examples from E belong to the same class Cj then label the leaf with Cj else

select the “best” decision attribute A with values v1, v2, …, vn for next node

divide the training set S into S1, …, Sn according to values v1,…,vn

recursively build subtrees T1, …, Tn for S1, …, Sn generate decision tree T


Entropy S - a sample of training examples; p+ (p-) is a proportion of positive (negative)

examples in S Entropy(S) = expected number of bits needed to

encode the classification of an arbitrary member of S

Information theory: optimal length code assigns-log2 p bits to message having probability p

Expected number of bits to encode “+” or “-” of random member of S: Entropy(S) - p- log2 p- - p+ log2 p+

Generally for c different classesEntropy(S) c- pi log2 pi


Entropy

The entropy function relative to a boolean classification, as the proportion of positive examples varies between 0 and 1

entropy as a measure of impurity in a collection of examples


Information Gain Search Heuristic

Gain(S,A) - the expected reduction in entropy caused by partitioning the examples of S according to the attribute A. a measure of the effectiveness of an attribute in

classifying the training data

Values(A) - possible values of the attribute A Sv - subset of S, for which attribute A has value v

The best attribute has maximal Gain(S,A) Aim is to minimise the number of tests needed for

class.

Gain S A Entropy SSvSv Values A

( , ) = ( ) -( )

Entropy Sv( )


Play Tennis: Information GainValues(Wind) = {Weak, Strong}

S = [9+, 5-], E(S) = 0.940 Sweak = [6+, 2-], E(Sweak) = 0.811 Sstrong = [3+, 3-], E(Sstrong) = 1.0

Gain(S,Wind) = E(S) - (8/14) E(Sweak) - (6/14) E(Sstrong) = 0.940 - (8/14) 0.811 - (6/14) 1.0 = 0.048

Gain(S,Outlook) = 0.246Gain(S,Humidity) = 0.151Gain(S,Temperature) = 0.029


Entropy and Information Gain

S contains si tuples of class Ci for i = {1, …, m}

Information measures info required to classify any arbitrary tuple

Entropy of attribute A with values {a1,a2,…,av}

Information gained by branching on attribute A

sslog

ss),...,s,ssI( i

m

i

im21 2

1

)s,...,s(Is

s...sE(A) mjj

v

j

mjj1

1

1

E(A))s,...,s,I(sGain(A) m 21


The ID3 Algorithm function ID3 (R: a set of non-categorical attributes,

C: the categorical attribute, S: a training set) returns a decision tree;

begin If S is empty, return a single node with value Failure; If S consists of records all with the same value for

the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value

the most frequent of the values of the categorical attribute that are found in records of S; [note that then there

will be errors, that is, records that will be improperly classified];


The ID3 Algorithm Let D be the attribute with largest Gain(D,S)

among attributes in R; Let {dj| j=1,2, .., m} be the values of attribute D; Let {Sj| j=1,2, .., m} be the subsets of S consisting

respectively of records with value dj for attribute D; Return a tree with root labeled D and arcs

labeled d1, d2, .., dm going respectively to the trees ID3(R-{D}, C, S1), ID3(R-{D}, C, S2), .., ID3(R-

{D}, C, Sm); end ID3;


C4.5 c4.5 is a program that creates a decision tree

based on a set of labeled input data. This decision tree can then be tested against

unseen labeled test data to quantify how well it generalizes.

The software for C4.5 can be obtained with Quinlan's book. A wide variety of training and test data is available, some provided by Quinlan.

Quinlan,J.R is working at RULEQUEST RESEARCH

company, See5/C5.0 has been designed to operate on large databases and incorporates innovations such as boosting.

http://www.mkp.com/books_catalog/1-55860-240-2.asp


C4.5 C4.5 is a software extension of the basic ID3 algorithm designed by

Quinlan to address the following issues not dealt with by ID3: Avoiding overfitting the data

Determining how deeply to grow a decision tree. Reduced error pruning. Rule post-pruning. Handling continuous attributes.

e.g., temperature Choosing an appropriate attribute selection measure. Handling training data with missing attribute values. Handling attributes with differing costs. Improving computational efficiency.


Running c4.5 On cunix.columbia.edu

~amr2104/c4.5/bin/c4.5 –u –f filestem c4.5 expects to find 3 files

filestem.names filestem.data filestem.test


File Format: .names The file begins with a comma separated list

of classes ending with a period, followed by a blank line E.g, >50K, <=50K.

The remaining lines have the following format (note the end of line period): Attribute: {ignore, discrete n, continuous,

list}.


Example: census.names>50K, <=50K.

age: continuous.workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov,

etc. fnlwgt: continuous.education: Bachelors, Some-college, 11th, HS-grad, Prof-school, etc. education-num: continuous.marital-status: Married-civ-spouse, Divorced, Never-married, etc.occupation: Tech-support, Craft-repair, Other-service, Sales, etc.

relationship: Wife, Own-child, Husband, Not-in-family, Unmarried.race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.sex: Female, Male.capital-gain: continuous.capital-loss: continuous.hours-per-week: continuous.native-country: United-States, Cambodia, England, Puerto-Rico,

Canada, etc.


File Format: .data, .test Each line in these data files is a comma

separated list of attribute values ending with a class label followed by a period. The attributes must be in the same order

as described in the .names file. Unavailable values can be entered as ‘?’

When creating test sets, make sure that you remove these data points from the training data.


Example: adult.test25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child,

Black, Male, 0, 0, 40, United-States, <=50K.38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing,

Husband, White, Male, 0, 0, 50, United-States, <=50K.28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv,

Husband, White, Male, 0, 0, 40, United-States, >50K.44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-

inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K.18, ?, 103497, Some-college, 10, Never-married, ?, Own-child, White, Female,

0, 0, 30, United-States, <=50K.34, Private, 198693, 10th, 6, Never-married, Other-service, Not-in-family,

White, Male, 0, 0, 30, United-States, <=50K.29, ?, 227026, HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0,

40, United-States, <=50K.63, Self-emp-not-inc, 104626, Prof-school, 15, Married-civ-spouse, Prof-

specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K.24, Private, 369667, Some-college, 10, Never-married, Other-service,

Unmarried, White, Female, 0, 0, 40, United-States, <=50K.55, Private, 104996, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband,

White, Male, 0, 0, 10, United-States, <=50K.65, Private, 184454, HS-grad, 9, Married-civ-spouse, Machine-op-inspct,

Husband, White, Male, 6418, 0, 40, United-States, >50K.36, Federal-gov, 212465, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.


c4.5 Output The decision tree proper.

(weighted training examples/weighted training error)

Tables of training error and testing error Confusion matrix

You’ll want to pipe the output of c4.5 to a text file for later viewing. E.g., c4.5 –u –f filestem > filestem.results


Example outputcapital-gain > 6849 : >50K (203.0/6.2)| capital-gain <= 6849 :| | capital-gain > 6514 : <=50K (7.0/1.3)| | capital-gain <= 6514 :| | | marital-status = Married-civ-spouse: >50K (18.0/1.3)| | | marital-status = Divorced: <=50K (2.0/1.0)| | | marital-status = Never-married: >50K (0.0)| | | marital-status = Separated: >50K (0.0)| | | marital-status = Widowed: >50K (0.0)| | | marital-status = Married-spouse-absent: >50K (0.0)| | | marital-status = Married-AF-spouse: >50K (0.0)

Tree saved


Example outputEvaluation on training data (4660 items):

Before Pruning After Pruning---------------- ---------------------------Size Errors Size Errors Estimate

1692 366( 7.9%) 92 659(14.1%) (16.0%) <<

Evaluation on test data (2376 items):

Before Pruning After Pruning---------------- ---------------------------Size Errors Size Errors Estimate

1692 421(17.7%) 92 354(14.9%) (16.0%) <<

(a) (b) <-classified as ---- ---- 328 251 (a): class >50K 103 1694 (b): class <=50K


k-fold Cross Validation Start with one large data set. Using a script, randomly divide this data set

into k sets. At each iteration, use k-1 sets to train the

decision tree, and the remaining set to test the model.

Repeat this k times and take the average testing error.

The avg. error describes how well the learning algorithm can be applied to the data set.


Outline Introduction Machine learning Version space and bias Decision tree learning Ripper algorithm Summary


Inductive Learning

Inductive “Learning from Examples.”

Training Examples Decision Rules

data-case 1 : decision i1

data-case 2 : decision i2

: :data-case n : decision in

InductiveLearning

Unit

pattern 1 decision j1

pattern 2 decision j2

: :pattern n decision jn


Ripper Ripper (Repeated Incremental Pruning to Producing

Error Reduction) Ripper algorithm proposed by Cohen in 1995 Ripper is consisted of two phase: the first is to

determine the initial rule set and the second is post-process rule optimization


Ripper separate-and-conquer rule learning algorithm. First the

training data are divided into a growing set and a pruning set. Then this algorithm generates a rule set in a greedy fashion, a rule at a time. While generating a rule Ripper searches the most valuable rule for the current growing set in rule space which can be defined in the form of BNF. Immediately after a rule is extracted on growing set, it is pruned on pruning set. After pruning, the corresponding examples covered by that rule in the training set (growing and pruning sets) are deleted. The remaining training data are re-partitioned after each rule is learned in order to help stabilize any problems caused by a “bad-split”. This process is repeated until the terminal conditions satisfy.


Ripper procedure Rule_Generating(Pos,Neg) begin Ruleset := {} while Pos ¹ {} do /* grow and prune a new rule */ split (Pos,Neg) into (GrowPos,GrowNeg) and (PrunePos,PruneNeg) Rule := GrowRule(GrowPos,GrowNeg) Rule := PruneRule(Rule,PrunePos,PruneNeg) if the terminal conditions satisfy then return Ruleset else add Rule to Ruleset remove examples covered by Rule from (Pos,Neg) endif endwhile return Ruleset end


Ripper After each rule is added into the rule set, the

total description length, an integer value, of the rule set is computed. The description length gives a measure of the complexity and accuracy of a rule set. The terminal conditions satisfy when there are no positive examples left or the description length of the current rule set is more than the user-specified threshold.


Ripper Post-process rule optimization Ripper uses some post-pruning techniques to

optimize the rule set. This optimization is processed on the possible remaining positive examples. Re-optimizing the resultant rule set is called RIPPER2, and the general case of re-optimizing “k” times is called RIPPERk.


Outline



Summary Inductive Learning is an important

approach for data mining Version space can be used to explain

generalization and specialization ID 3 and C4.5 Ripper algorithms generate efficient

rules


References Zhongzhi Shi. Principles of Machine Learning. International

Academic Publishers, 1992 Jiawei Han and Micheline Kamber. Data Mining: Concepts and

Techniques Morgsn Kaufmann Publishers, 2000 Zhongzhi Shi. Knowledge Discovery. Tsinghua University Press.

2002 H. Liu and H. Motoda. Feature Selection for Knowledge

Discovery and Data Mining. Kluwer Academic Publishers, 1998. R. S. Michalski. A theory and methodology of inductive learning.

In Michalski et al., editor, Machine Learning: An Artificial Intelligence Approach, Vol. 1, Morgan Kaufmann, 1983.

T. M. Mitchell. Version spaces: A candidate elimination approach to rule learning. IJCAI'97, Cambridge, MA.

Quinlan,J.R.: C4.5: Programs for Machine Learning Morgan Kauffman, 1993

T. M. Mitchell. Machine Learning. McGraw Hill, 1997. J. R. Quinlan. Induction of decision trees. Machine Learning,

1:81-106, 1986.


www.intsci.ac.cn/shizz/

Questions?!Questions?!

http://www.intsci.ac.cn/shizz






Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr

Documents

Transcript of Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr