Download - Rule Generation using DM in Training on Expert system Development using Agri-Daksh

National Centre for Agricultural Economics and Policy Research (NCAP), New Delhi

National Centre for Agricultural Economics and Policy Research (NCAP), New Delhi

Rajni [email protected]

Expert System: Major Constraint

• Expert systems are automated systems that provide knowledge like experts and are useful when it is difficult to get expert advice

• Development of expert system is challenging because it is to hard to develop the complete rule base which can work like an expert:– Non availability of experts– Scarcity of time– Cost of their time

DM: An aid for development of Expert System

• DM: Extraction of potentially useful, meaningful and novel patterns from the data using suitable algorithms

• It is a great idea to use data mining tools and techniques to extract rules.

• reduce the time required for interactions with the domain experts

• Rules should never be used without validation from experts

Introduction

Importance of Rules

Directly from Decision Tables

DM

– DT– RS– Fuzzy– ANN

Decision TREE

•Used to represent a sequence of decisions to arrive at a particular result

Y N

ColourWhite

Red

Decision-Tree (ID3)A decision-tree is a set of nodes and leaves

where each node tests the value of an attribute and branches on all possible values.ID3 is the first most popular decision-tree building

algorithm.A statistical property called information gain is

used to decide which attribute is best for the node under consideration.Greedy Algorithm, all attributes till the examples

reach leaf node are retained for computation. Parent node attribute is eliminated.

CLS (Concept Learning System)

Step 1: If all instances in C are positive, then create YES node and

halt.– If all instances in C are negative, create a NO node and halt.– Otherwise select a feature, F with values v1, ..., vn and create a decision node.

Step 2: Partition the training instances in C into subsets C1, C2, ..., Cn

according to the values of V.

Step 3: apply the algorithm recursively to each of the sets Ci.

CLS contd..

Note, the trainer (the expert) decides which feature to

select.

ID3 Algorithm

ID3 improves on CLS by adding a feature selection heuristic.

ID3 searches through the attributes of the training instances and extracts the attribute that best separates

the given examples. If the attribute perfectly classifies the training sets then ID3 stops; otherwise it recursively

operates on the n (where n = number of possible values of an attribute) partitioned subsets to get their "best"

attribute.

The algorithm uses a greedy search, that is, it picks the best attribute and never looks back to reconsider

earlier choices.

Data Requirements for ID3

Attribute value Description: The same attributes must describe

each example and have a fixed number of values

Predefined classes: An example’s attributes must already be

defined as they are not learned by ID3

Discrete Classes

Sufficient Examples

Attribute Selection in ID3

How does ID3 decides which attribute is the best?– Information Gain, a statistical property is used

Information gain measures how well a given attribute separates training

examples into targeted classes

The attribute with the highest information gain is selected

Entropy measures the amount of the information

Entropy

Given a collection S of c outcomes

Entropy(S) = -p(I) log2 p(I) where p(I) is the proportion of S belonging to class I. is over c. Log2 is log base 2.Note that S is not an attribute but the entire sample set.

Example 1: Entropy

If S is a collection of 14 examples with 9 YES and 5 NO examples then

Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940

Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified). The range of entropy is 0 ("perfectly classified") to 1 ("totally random").

Information Gain

Gain(S, A) is information gain of example set S on attribute A is defined as

Gain(S, A) = Entropy(S) - ((|Sv| / |S|) * Entropy(Sv))

Where:is each value v of all possible values of attribute A

Sv = subset of S for which attribute A has value v

|Sv| = number of elements in Sv

|S| = number of elements in S

Example 2: Information Gain

Suppose S is a set of 14 examples in which one of the attributes is wind speed. The values of Wind can be Weak or Strong. The classification of these 14 examples are 9 YES and 5 NO. For attribute Wind, suppose there are 8 occurrences of Wind = Weak and 6 occurrences of Wind = Strong. For Wind = Weak, 6 of the examples are YES and 2 are NO. For Wind = Strong, 3 are YES and 3 are NO. Therefore

Gain(S,Wind)=Entropy(S)-(8/14)*Entropy(Sweak)-(6/14)*Entropy(Sstrong) = 0.940 - (8/14)*0.811 - (6/14)*1.00= 0.048

Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811

Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00

For each attribute, the gain is calculated and the highest gain is used in the decision node.

Example 3: ID3

Suppose we want ID3 to decide whether the weather is amenable to playing baseball. Over

the course of 2 weeks, data is collected to help ID3 build a decision tree (see table 1).

The target classification is "should we play baseball?" which can be yes or no.

The weather attributes are outlook, temperature, humidity, and wind speed. They can have

the following values:

outlook = { sunny, overcast, rain }

temperature = {hot, mild, cool }

humidity = { high, normal }

wind = {weak, strong }

Day Outlook Temperature

Humidity Wind Play ball

D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

Real Example:Summary of dataset

NSSO carried out a nationwide survey on cultivation practices as a part of its 54th round from Jan-June 1998The enquiry was carried out in the rural areas of

India through house hold survey to ascertain the problems and utilization of facilities and the level of adoption in farms of different sizes in the countryFocus on the spread of improved agricultural

technologyData is extracted for the Haryana State

Dataset Summary

39 independent variables from Haryana1 dependent variable (ifpesticide) value 1,2

– 1:adoption of PIF– 2:non-adoption of PIF

36-nominal, 4 real-valued attributes1832 cases

– 48% adopting PIF – 52% not adopting

Why Haryana?

• Reasonable dataset size

(1832 cases)

• Independent variable (if pesticide) is some what uniformly distributed in Haryana

Lakshdweep 0 Arunachal 35

Mizoram 1 Bihar 40

Sikkim 2 Maharastra 43

Dadra 8 A&N 43

Nagaland 11 Haryana 48

Himachal 14 Goa 52

Meghalaya 16 Karnatka 55

Kerala 20 Tripura 58

UP 23 Gujrat 65

Rajasthan 24 Punjab 71

MP 27 DamanDiu 74

J&K 29 Andhra 79

Manipur 31 TN 80

Assam 32 WB 82

Orissa 32 Pondicherry 88

Delhi 33 Chnadigarh 100

Distribution of Class Variable

Variables in Haryana-Farmers Dataset (NSSO, 54th round)

# Attribute Name Value set # Attribute Name Value set

1 SnoCropGroup {1,2,3,4,5} 21 jointforest {1,2}

2 CropCode {1,2,3,4,5,6,7,8,9,11,99} 22 managetanks {1,2}

3 Season {1,2} 23 iftreepatta {1,2}

4 NumAreaSown {1,2,3,4,5,6} 24 iftimberright {1,2}

5 iftractor {1,2,3} 25 howoftenright {1,2,9}

6 ifepump {1,2,3} 26 schhhdcpr {1,2,9}

7 ifoilpump {1,2,3} 27 anymemberpreventcpr {1,2}

8 ifmanure {1,2,3} 28 LandWithRightOfSale {1,2,3,4,5}

9 iffertilizers {1,2,3} 29 TotLandOwnedNum {1,2,3,4,5}

10 ifimproveseed {0,1,2,3} 30 LandPossessNum {1,2,3,4,5}

11 seedtype {1,2,3,4,9} 31 NASNum {1,2,3,4,5}

12 typemanure {1,2,3,9} 32 ifsoiltested {1,2}

13 ifhired {1,2,9} 33 ifsoilRecfollow {1,2,9}

14 ifirrigate {1,2} 34 ifhhdtubewell {1,2}

15 ifhiredirri {1,2,9} 35 iftubewellunused {1,2,9}

16 ifweedicide {1,2} 36 irrigatesource {1,2,9}

17 ifunusdelectricpump {1,2,9} 37 ifdieselpump {1,2}

18 harvested {1,2,3} 38 ifunuseddieselpump {1,2,9}

19 woodpurpose {1,2,3,4,5,9} 39 ifelectricpump {1,2}

20 iflivestock {1,2} 40 ifpesticide {1,2}

Hypothetical Farmer Dataset

ID Crop Seedtype AreaSown Iffertilizer Ifpesticide

X1 wheat 2 small no yes

X2 wheat 3 medium yes no

X3 wheat 1 medium yes no

X4 rice 1 medium no yes

X5 fodder 2 large no yes

X6 rice 3 large no no

X7 rice 2 large no no

X8 wheat 1 small yes no

Decision Tree (DT) using RDT Architecture for Hypothetical Farmer

Data

no

Seedtype

1

Cropricefodderwheat

no yes?

Cropricefodderwheat

yes noyes

2 3

•The Tree is simple to comprehend•The tree can be easily traversed to generate rules

Description of the codes of the attributes in the Approximate Core based RDT model

Attribute name # Code specifications

Ifweedicide 16 yes -1, no-2

CropCode 2 paddy -1, wheat -2, other cereals -3, pulses- 4, oil seeds -5, mixed crop – 6, sugarcane -7, vegetables -8,

fodder -9, fruits & nuts -10, other cash crops -11, others 99

Iffertilizer 9 entirely -1, partly-2, none-3

Ifpesticide 40 yes-1, no-2

If weedicide

11

1CropCode

1

1 22

2

2 3

2

2

4

2

2

5

2

2

26 7

99

8

9

2 2 2

2 2 2

Iffertilizer

12

2

1

31

Iffertilizer

11

1

2

1

32

11

Approximate Core based RDT model for Farmers Adopting Plant Protection Chemicals

1

Ifweedicide: yes -1, no-2

Ifpesticide:yes-1, no-2

Cropecode: paddy -1, wheat -2, other cereals -3, pulses- 4, oil seeds -5, mixed crop – 6, sugarcane -7, vegetables -8,fodder -9, fruits & nuts -10, other cash crops -11, others 99

Iffertilizer: entirely -1, partly-2, none-3

Decision Rules for Adopters1. If Weedicide= yes then pesticide=yes

2. If weedicide=no and crop=paddy then pesticide=yes

3. If weedicide=no and crop=vegetables and fertilizer=entirely then pesticide=yes

4. If weedicide=no and crop=vegetables and fertilizer=none then pesticide=yes

5. If weedicide=no and crop=cash and fertilizer=entirely then pesticide=yes

6. If weedicide=no and crop=cash and fertilizer=partly then pesticide=yes

Decision Rules for Non-adopters

1.If weedicide=no and crop=(wheat, other than rice cereals, pulses, oil seeds, mixed crop, sugarcane, fodder, other crops) then pesticide=no

2.If weedicide=no and crop = vegetables and fertilizer=partly then pesticide=no

3.If weedicide=no and crop = other cash crops and fertilizer=none then pesticide=no

A case study of forwarning Powdery Mildew of Mango disease

YEAR T811 H811 T812 H812 T813 H813 T814 H814STATUS

1987 28.20 91.25 28.48 88.60 29.50 85.50 30.14 82.86 1

1988 32.05 75.75 31.64 80.60 31.27 77.83 30.66 79.57 0

1989 26.60 86.25 26.28 87.20 26.47 87.33 26.31 89.14 0

1990 27.50 91.25 28.12 91.20 28.17 92.00 28.43 91.00 1

1991 28.43 87.00 28.70 86.20 29.00 83.50 29.57 80.57 0

1992 30.12 69.23 30.45 68.58 30.80 68.31 31.25 67.82 1

1993 30.50 61.75 30.48 61.13 30.37 60.56 30.33 61.76 0

1994 30.45 89.25 30.56 85.80 30.63 83.17 30.71 81.14 1

1995 28.63 61.38 29.10 61.20 29.58 61.17 30.71 61.57 0

1996 31.63 60.33 31.90 60.87 32.67 60.89 33.07 59.76 1

1997 32.13 71.00 32.20 69.40 31.67 69.00 31.50 68.29 0

2000 29.00 78.33 29.23 78.60 29.36 78.83 29.52 79.14 0

Dataset: PWM status for 11 years and corresponding temperature and humidity conditions

Techniques for Rule based learning

NATIONAL CENTRE FOR AGRICULTURAL ECONOMICS AND POLICY RESEARCH

OrThe hybrids likeRDT

Selection of Algorithms

• Accuracy: Fraction of test instances which are predicted accurately

• Interpretability: Ability to understand• Complexity: No of conditions in the model• No of rules• No f variablesSelection of an algorithm involves trade-off with

the parameters

ID TRAIN TEST

1987-94 1987-94 1995, 1996, 1997, 2000

1987-95 1987-95 1996, 1997, 2000

1987-96 1987-96 1997, 2000

1987-97 1987-97 2000

Train and Test Data

The Prediction model for PWM Epidemic as obtained using CJP Algorithm on 1987-97 data as the training

dataset


<=87 >87

<=30.7 >30.7

Rules


•If (H811>87) then Status = 1

•If (H811<=87) and (T814<=30.71) then Status = 0

•If (H811<=87) and (T814>30.71) then Status = 1

Conclusions

• Rule induction can be very useful for development of expert systems

• Strong linkage and collaborations are required among expert system developer, domain experts and data mining experts