Data Mining Chapter 2 Input: Concepts, Instances, and Attributes

111
1 Data Mining Chapter 2 Input: Concepts, Instances, and Attributes Kirk Scott

description

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes. Kirk Scott. Hopefully the idea of instances and attributes is clear Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this - PowerPoint PPT Presentation

Transcript of Data Mining Chapter 2 Input: Concepts, Instances, and Attributes

Page 1: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

1

Data MiningChapter 2

Input: Concepts, Instances, and Attributes

Kirk Scott

Page 2: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

2

What Software Are We Using?

• The book uses a software package named Weka

• Part III of the book, chapters 10-17, are about Weka specifically

• These will not be covered in this course• When you do the project you will gain

experience (i.e., teach yourself) with Weka• The book chapters are your reference if

you can’t figure Weka out from the GUI

Page 3: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

3

• Weka• From Wikipedia, the free encyclopedia• Jump to: navigation, search • For other uses, see Weka

(disambiguation).

Page 4: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

4

• The Weka or woodhen (Gallirallus australis) is a flightless bird species of the rail family. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit. Weka usually lay eggs between August and January; both sexes help to incubate.

Page 5: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

5

Page 6: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

6

• Behaviour• …• Where the Weka is relatively common,

their furtive curiosity leads them to search around houses and camps for food scraps, or anything unfamiliar and transportable.[2]

Page 7: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

7

Page 8: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

8

Page 9: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

9

Page 10: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

10

Chapter 2 Sections

• 2.1 What’s a Concept?• 2.2 What’s in an Example?• 2.3 What’s in an Attribute?• 2.4 Preparing the Input

Page 11: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

11

2.1 What’s a Concept?

Page 12: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

12

• The book speaks in terms of instances, attributes, and concepts

• Instances and attributes are the same ideas that come up in database

• A concept is that “thing” which is to be mined for the data, which is somehow inherent in the data

Page 13: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

13

• Data mining was defined earlier as finding a structural representation

• Essentially the same idea is now expressed as finding a concept description

Page 14: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

14

• Note the “known” unknowns and “unknown” unknowns aspect of this

• You may have a concept you want to test—sort of like a hypothesis from the natural sciences

• More generally, you may simply want to find out whether any concepts are present—and what they might be

Page 15: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

15

Concept Description

• The concept description needs to be:• Intelligible

– It can be understood, discussed, disputed• Operational

– It can be applied to actual examples

Page 16: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

16

Types of Discovery

• Classification– Prediction

• Clustering– Outliers

• Association• Each of these is a concept• Successful accomplishment of these for a

data set is a concept description

Page 17: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

17

Classification, Association Rules, and Clustering

Page 18: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

18

Classification

• Examples thus far: Weather, contact lenses, iris, labor contracts

• All were essentially classification problems• In general, the assumption is that classes

are mutually exclusive• In complicated problems, data sets may

be classified in multiple ways• This means individual instances can be

“multi-labeled”

Page 19: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

19

Supervised Learning

• Classification learning is supervised• There is a training set• A structural representation is derived by

examining a set of instances where the classification is known

• How do you test the accuracy of the structural representation?

• Apply the results to another data set with known classifications

Page 20: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

20

Numerical Prediction

• This is a variation on classification• Given n attribute values, determine the

(n + 1)st attribute value• Recall the CPU performance problem• It would be a simple matter to dream up

sample data where the weather data predicted how long you would play rather than a simple yes or no

• (The book does so)

Page 21: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

21

Association Rules

• In any given data set there can be many association rules

• A rule takes the form “if attribute x then attribute y” (xy)

• Considering pairwise associations of n attributes, the total number of rules may approach n(n – 1) / 2

• There may also be rules with multiple x in the protasis (also add apodosis to your vocabulary…)

Page 22: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

22

Support and Confidence

• The book hasn’t yet introduced the terms support and confidence, but it discusses these ideas

• The terms will come up later in the book• The ideas and the terms for them will be

introduced now in the overheads

Page 23: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

23

Support for Association Rules

• Let an association rule X = (x1, x2, …, xi)y be given in a data set with m instances

• The support for Xy is the count of the number of instances where the combination of x values, X, occurs in the data set, divided by m

Page 24: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

24

• In other words, the association rule may be interesting if it occurs frequently enough

• Put negatively, if that combination of x values occurs infrequently, is there reason to believe that there is any kind of rule at all, or that it would be of much use in practice?

Page 25: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

25

Confidence for Association Rules

• Confidence here is based on the statistical use of the term

• The confidence for Xy is the count of the number of occurrences in the data set where this relationship holds true divided by the number of occurrences of X overall

• The book describes this idea as accuracy

Page 26: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

26

• In other words, the association is interesting the more likely it is that X does determine y

• Put negatively, consider the case where X determines y less than half the time

• Would that even be a rule?• In that case, !Xy would a rule• In general, the lower the confidence, the

less useful the rule (as a predictor, for example)

Page 27: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

27

Clustering

• Given a data set without predefined classes, is it possible to determine classes that the instances fall into?

• Having determined the classes, can you then classify future instances into them?

• Outliers are instances that you can definitely say do not fall into any of the classes

Page 28: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

28

Levels of Hypothesis

• Note the levels of hypothesis in data mining

• Clustering is a good example• High level hypothesis:• We hypothesize, generally, that data set

instances may fall into recognizable groups

• We don’t know what those groups might be

Page 29: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

29

• We apply one or more data mining techniques to see whether they will identify useful clusters

• Clusters may or may not be discovered

Page 30: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

30

• Low level hypothesis:• In theory, we might have hypothesized in

advance what the clusters might be• But instead, we rely on data mining to

determine what they are, if they exist at all• Both the fact that clusters exist, and what

they are, are concept descriptions produced by data mining

Page 31: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

31

2.2 What’s in an Example?

Page 32: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

32

• The authors are trying to present some important ideas

• In case their presentation isn’t clear, I present it here in a slightly different way

• The basic premise goes back to this question:

• What form does a data set have to be in in order to apply data mining techniques to it?

Page 33: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

33

Data Sets Should Be Tabular

• The simple answer based on the examples presented so far:

• The data has to be in tabular form, instances with attributes

• In fact, the data mining view is that data sets fall into one table only

• The remainder of the discussion will revolve around questions related to normalization in db

Page 34: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

34

Not All Data is Naturally Tabular

• Some data is not most naturally represented in tabular form

• Consider OO db’s, where the natural representation is tree-like

• One question that will have to be answered:

• How should such a representation be converted to tabular form that is amenable to data mining?

Page 35: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

35

Correctly Normalized Data May Fall into Multiple Tables

• You might also have data which naturally falls into >1 table

• Or, you might have data that has been normalized into >1 table

• How do you make it conform to the single table model (instances with attributes) for data mining?

Page 36: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

36

• Tree-like data and multi-table data may be related questions

• It would not be surprising to find that a conversion of a tree to a table resulted in >1 table

Page 37: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

37

Denormalization

• The situation goes against the grain of correct database design

• The classification, association, and clustering you intend to do may cross db entity boundaries

• The fact that you want to do mining on a single tabular representation of the data means you have to denormalize

Page 38: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

38

• In short, you combine multiple tables back into one table

• The end result is the monstrosity that is railed against in normalization theory:

• The monolithic, one-table db

Page 39: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

39

The Book’s Family Examples

• Family relationships are typically viewed in tree-like form

• The book considers a family tree and the relationship “is a sister of”

• The factors for inferring sisterhood:• Two people, one female• The same (or at least one common)

parents for both people

Page 40: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

40

Two People (Possibly Siblings) in the Same Row in the Same Table

• For the purposes of data mining, you can’t maintain the data in a tree-like form

• One tabular representation:• Put two people in each row of a table• Record all possible pairs• Have a final column in the table that says

whether the first person of the pair is a sister of the second person of the pair

Page 41: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

41

• Obviously, from a db point of view, this is kind of ugly

• The table essentially consists of the Cartesian product of a Person table with itself, with a classification field added

• On the other hand, this might be a practical approach for data mining

Page 42: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

42

The Closed World Assumption

• Recall that according to normalization, a truly one-to-one relationship can be stored in a single table

• In theory, you might have a case where you have all rows and only the rows where the classification was yes, “is a sister of”

• Then any other pairing that might crop up could by default be regarded as “no”

Page 43: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

43

• This kind of case is known as the “closed world assumption” in data mining

• It is largely theoretical• Unfortunately, it hardly ever happens that

you have a problem where this kind of simplifying assumption applies in practice

• You have to deal with all cases

Page 44: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

44

Two People (Parent-Child) in the Same Row in the Same Table

• This refers to the Mother-Child scenario from db, where you have one Person table

• Each row contains a field indicating gender and fields indicating parents

• This is enough information to determine sisterhood

• You don’t have a Cartesian product, but you do have an analytical problem sorting through the data to find sisterhood

Page 45: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

45

Association Rules and the Connection with Normalization

• There is a problem with denormalized data mining which is completely analogous to the normalization problem

• Suppose you have two people in the same instance (the same row) each with their attributes

• By definition, you will have stray dependencies

• The Person identifiers determine the attribute values

Page 46: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

46

• What would happen if you mined for associations?

• The algorithm would find the perfectly true, but already known associations between the pk identifiers of the people and their attribute fields

• This is not helpful• It’s a waste of effort

Page 47: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

47

Recursive Relationships

• Recall the monarch and product-assembly examples from db

• These give tables in recursive relationships with themselves or others

• How do you deal with the question of “is descended from” when there is a potentially unlimited sequence of ancestors?

• Recall that simple SQL didn’t support recursive queries

Page 48: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

48

• In the data mining case, you would need recursive rules

• Mining recursive rules is a step beyond classification, association, etc.

• The good news is that this topic will not be covered further

• It’s simply of interest to know that the same question that arises in db also arises in data mining

Page 49: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

49

One-to-Many Relationships

• A denormalized table might be the result joining two tables in a pk-fk relationship

• Concretely, the Mother-Child situation is an example of this

• We are interested in doing classification• But don’t be confused:• In this scenario, we are trying to classify

mothers—but not classify them into groups of one, namely the mother of a certain set of children

Page 50: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

50

• The goal is to classify the “one” side of the relationship contained in the table, the mothers, on some other characteristic

• In this case you have multiple instances in the table which are not independent

• If a given mother has more than one child, she will appear more than once in the table

• In data mining this is called a multi-instance situation

Page 51: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

51

• The multiple instances belonging to one classification together form one example of the (data mining) concept under consideration in such a problem

• Data mining algorithms have been developed to handle cases like these

• They will be presented with the other algorithms later

Page 52: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

52

Summary of 2.2

• The fundamental practical idea here is that data sets have to be manipulated into a form that’s suitable for mining

• This is the input side of data mining• The reality is that denormalized tables

may be required• Data mining can be facetiously referred to

as file mining since the required form does not necessarily agree with db theory

Page 53: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

53

• The situation can be restated in this way:• If you’re starting with multiple, normalized

tables, assemble them into a monolithic table using a join query

• Then mine this• Existing data mining algorithms have been

devised to handle the fact that they work with denormalized data

Page 54: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

54

• This leads to an open question:• Would it be possible to develop a data

mining system that could encompass >1 table, crawling through the pk-fk relationships like a query, finding classifications, associations, and clusters?

Page 55: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

55

2.3 What’s in an Attribute?

Page 56: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

56

• This subsection falls into two parts:• 1. Some ideas that go back to db design

and normalization questions• 2. Some ideas having to do with data type

Page 57: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

57

Design and Normalization

• You could include different kinds (subtypes) of entities in the same table

• To make this work you would have to include all of the fields of all of the kinds of entities

• The fields that didn’t apply to a particular instance would be null

• The book uses transportation vehicles as an example: ships and trucks

Page 58: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

58

• You could also have fields in a table that depend on each other (generally a sign of poor design)

• The book gives married T/F plus spouse’s name as attributes in the same table as an example

• Again, you can handle F with a null value for spouse’s name

Page 59: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

59

Data Types

• The simplest distinction is numeric vs. categorical

• Some synonyms for categorical: symbolic, nominal, enumerated, discrete

• There are also two-valued variables known as Boolean or dichotomy

Page 60: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

60

Spectrum of Data Types

• 1. Nominal• 2. Ordinal• 3. Interval• 4. Ratio• Explanations follow

Page 61: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

61

• 1. Nominal attributes are unordered, unmeasurable named categories

• Example: sunny, overcast, rainy• Logically there is an order here, but you

can’t say things like sunny < overcast, overcast < rainy

Page 62: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

62

• 2. Ordinal attributes are named categories that can be put into a logical order but which have no intrinsic numeric value and no defined distance between them

• Example: hot, mild, cool• This kind of attribute does support < and >

comparison

Page 63: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

63

• 3. Interval attributes are numeric values where the distance between them makes sense

• Example: Time expressed in years• This kind of attribute supports subtraction

but other numeric operations do not typically make sense

Page 64: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

64

• 4. Ratio attributes are real or continuous (or possibly integer) numeric values on a scale with a natural 0 point

• Example: Physical distance• This kind of attribute supports all numeric

operations

Page 65: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

65

• In principle, data mining has to handle all possible types of data

• In practice, applied systems typically have some useful subset of the type distinctions given above

• You adapt your data to the types provided• This is like using the data types provided

in a db management system

Page 66: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

66

2.4 Preparing the Input

Page 67: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

67

• In practice, preparing the data can take more time and effort than doing the mining

• Real data tends to be low in quality• Think data integrity and completeness• “Cleaning” the data before mining it pays

off• Data needs to be in the format required by

whatever mining software you’re using

Page 68: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

68

Gathering the Data Together

• In a large organization, different departments may manage their own data

• Global level data mining will require integration of data from multiple databases

• If you’re lucky, the organization has already created a unified archive, a data warehouse

• Interesting mining may also require integrating external data into the data set

Page 69: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

69

Aggregation

• It may be necessary to aggregate data in order to mine it successfully

• You may be interested in data at a level above that of parameters occurring in individual instances

• In other words, you may need to aggregate data from corresponding fields of multiple instances in order for it to be useful

Page 70: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

70

• The type of aggregation is important• Remember the aggregation operators in

db: COUNT, SUM, AVERAGE, etc.• The level of aggregation is important• Remember GROUP BY in db• Do you aggregate all instances, or is it

useful to do it by subsets of some sort?

Page 71: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

71

ARFF (Format)

• The regular format for data mining files in Weka is ARFF = attribute relation file format

• XRFF is the XML version of the format• In ARFF:• % marks a comment• @ marks file descriptor information,

relation, attributes, and data

Page 72: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

72

• For a categorical attribute, the set of possible values is given in braces

• Categorical values containing spaces have to be put in quotation marks

• Numeric attributes are simply identified as numeric

Page 73: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

73

• Instances in the file are given line by line• They are separate by new lines• Attribute values in instances are separated

by commas• Missing values (nulls) are indicated with a

question mark

Page 74: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

74

• In an ARFF file, a classification attribute, if there is one, is treated no differently than any others

• The format is equally suited to classification, association, or cluster mining

• Figure 2.2, on the following overhead, shows the weather data set in ARFF

Page 75: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

75

Page 76: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

76

Weka Has Three Additional Attribute Types

• String = the moral equivalent of VARCHAR in db

• Date = the equivalent of DATE in db• Relational• It is not immediately apparent what this

means• It will be covered next

Page 77: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

77

Relational-Valued Attributes

• The book gives an example which is OK, but it’s not necessarily presented in the clearest way possible

• My plan is to first give some explanatory background

• Then explain the book’s example in a slightly different order than it does

Page 78: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

78

Relational Background

• Recall that multi-valued problems can be viewed as mining the result of a 1-m join

• In preparing a data set for mining, this is what a relational-valued attribute is:

• It is an attribute that can contain or consist of multiple instances of the same kind of set of values, where these sets belong together for some reason

Page 79: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

79

• In a 1-m, pk-fk join, the multiple sets are the rows of the many table which belong together because they share the same fk value

• In case this general overview isn’t clear, the idea can be illustrated with mothers and children

Page 80: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

80

Mothers and Children

• Suppose you ran this query:• SELECT *• FROM Mother, Child• WHERE Mother.motherid = Child.motherid• GROUP BY motherid• Children of the same mother would be

grouped together

Page 81: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

81

• In data mining, it is possible that you would want to elicit information about children in general

• You might also want to elicit information that generally held for children of the same mother

• Conceptually, you would be mining information about siblinghood

Page 82: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

82

• This is where relational-valued attributes come in

• From a relational point of view, the representation is wrong

• First normal form says you have flat files with no repeating groups

• But for data mining purposes, in ARFF format, you want the repeating groups

Page 83: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

83

Explaining the Book’s Example

• The weather adapts the weather/play a game data to a multi-valued example

• The new twist is this: Games extend over 2 days, not just one

• Each day is still a single instance• But for each game, there are two of these

instances which belong together

Page 84: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

84

• The instances in the rows of the table representing game information will be multi-valued

• Each game will contain two days’ worth of weather data

• These two days’ worth of data are the contents of one relational-valued attribute in the overall data set

Page 85: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

85

• Note that in general, relational attributes, multi-valued attributes, are not limited to 2 sets of values

• This is just an artifact of the book’s example, where games last exactly 2 days

Page 86: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

86

• The book also uses terminology which could be clearer

• In their new weather table they name the relational attribute “bag”

• It would probably have been clearer if they had named the attribute game_days or something like that

Page 87: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

87

• There are three major attributes in the book’s table:

• bag_ID (id for sets of days belonging to games)

• bag (multi-valued relational attribute containing days belonging to games, grouped by game)

• play (the classification to play, yes or no)

Page 88: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

88

• The relational valued bag attribute has 4 (familiar) attributes describing the multi-valued instances (of day):

• outlook• temperature• humidity• windy

Page 89: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

89

• In the body of the ARFF table, the multi-valued entries are structured in this way:

• The data for the multiple days that belong together for a single bag_ID is enclosed in quotation marks

• Within the quotation marks, the individual sets of day data are separated by “\n”, the new line character

Page 90: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

90

• The book’s ARFF table is shown in Figure 2.3 on the following overhead

Page 91: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

91

Page 92: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

92

• In summary, this is what a relational attribute is:

• It is a way of describing a denormalized data construct, where in the data set, for a 1-m relationship, the many entities are grouped together according to the one entity they match with

• ARFF has a straightforward syntax for recording data in this way

Page 93: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

93

Sparse Data

• Some data sets are sparse• In this context the book (Weka) means

occurrences of 0’s for numerical values, not nulls

• If the data set is sufficiently sparse, rather than listing everything, a row can be economically expressed by showing only the values present

Page 94: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

94

• In ARFF, the attributes for a row can be identified by number starting with 0

• The id number is followed by the value, and the id/value pairs can are separated by commas

• The values for a row are enclosed in braces

Page 95: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

95

• For example, this is the representation of a row with 10 different attributes where 7 of them are 0

• {1 X, 6 Y, 10 “Class A”}• This technique doesn’t work for nulls• You still have to include each attribute with

a ? representing null

Page 96: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

96

Attribute Types

• The bottom line is that ARFF only has two fundamental types: nominal and numeric

• String attributes are effectively nominal• Date attributes are effectively numeric• (Recall the discussions of stuff like this in

db)• The rest of this subsection has to do with

numeric types in particular

Page 97: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

97

Numerics as Ordinals

• The important point is this:• Numerics in ARFF simply look like

numbers• But different data mining algorithms treat

numeric values differently• One algorithm may treat numerics as

ordinals, where subtraction applies, generating rules based on <, =, > comparisons

Page 98: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

98

Numerics as Ratio Values

• Another algorithm may treat numerics as ratio values

• Recall that all arithmetic operations are defined in this case

• The algorithm may normalize ratio values

Page 99: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

99

Normalization

• Normalization means putting values into a range, most commonly the range 01

• A simple approach for positive values: Divide any given data value by the maximum present

• Another simple approach for positive values: Subtract the minimum from the data value and divide by (max – min)

Page 100: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

100

Standardization

• Values can also be statistically standardized

• Each data point is converted using this approach:

• xstandardized = (x – μ) / σ

• This puts the values into a distribution where the mean is 0 and the standard deviation is 1

Page 101: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

101

Distance as an Example of Ratio Values

• Consider the calculation of distance in 2-dimensional space for example

• Calculating the square root of the sum of the squares of the differences of the coordinates involves using arithmetic operators other than subtraction

• Normalization may be used in a situation like this

Page 102: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

102

• Given some (x, y) space, suppose x is in the range 010 and y is in the range 0100

• Do you normalize both x and y before calculating distances or not?

• Another way of stating this is, do x and y make corresponding contributions to the measure of distance between two data points or not?

Page 103: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

103

Nominal Attributes and Distance

• This is a crude measure of distance for nominal attributes:

• If two instances have the same value for that attribute, the distance between them, measured on that attribute is 0

• If two instances have a different value for an attribute, the distance between them is 1

Page 104: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

104

• There are cases where nominal attributes can be reverse engineered back to numerics

• One example from the book: Zip codes and geographic location coordinates

• Recall that zip codes came up in db in a similar way as determinants of geographic locations

Page 105: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

105

Nominal vs. Numeric

• Just like in db, the assertion is made that an id “number” field should be TEXT—

• In data mining there may be attributes containing numeric digits which are simply nominal fields and should be mined as such

Page 106: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

106

• Finally, some algorithms support nominals but not ordinals

• In the contact lens data, young < pre-presbyopic < presbyopic

• If their ordinal relationships is not recognized, a complete and correct set of rules can still be mined

• However a complete and correct set of rules about 1/3 as large can be mined in a system that recognizes the relationship

Page 107: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

107

Missing Values

• This is essentially a discussion of nulls• The only new element consists of two

related questions:• Can you infer anything from the absence

of values?• Would it be possible to meaningfully code

why values are absent and mine something from this?

Page 108: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

108

Inaccurate Values

• This is essentially a discussion of data integrity

• Both data miners and regular db users have to cope with faulty data one way or the other

Page 109: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

109

• The authors say this is especially important when mining

• It’s especially important if the data mining algorithm ascribes more significance to an attribute than a regular user does

• In general, the relative importance or significance of different attributes in n-dimensional space is a big question in data mining

Page 110: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

110

A Whimper Rather than a Bang

• This set of overheads is about to end• Remember that the overall topic was input

into data mining• This included questions both about the

content and the form of the data• The next chapter will discuss the output of

data mining—how the knowledge or structure is represented by different algorithms

Page 111: Data Mining Chapter 2 Input:  Concepts, Instances, and Attributes

111

The End