Input: Concepts, Instances, Attributesseem5470/lecture/Input-2017.pdfInaccurate values Reason: data...

16
Input: Concepts, Instances, Attributes 1

Transcript of Input: Concepts, Instances, Attributesseem5470/lecture/Input-2017.pdfInaccurate values Reason: data...

Input: Concepts, Instances, Attributes 

1

TerminologyComponents of the input:● Concepts: kinds of things that can be learned

◆ aim: intelligible and operational concept description

● Instances: the individual, independent examples of a concept◆ note: more complicated forms of input are possible

● Attributes: measuring aspects of an instance◆ we will focus on nominal and numeric ones

2

Kinds of Learning● Concept: thing to be learned● Concept description: output of learning scheme● Different kinds of learning:

◆ Classification learning● predicting a discrete class● numeric prediction

◆ Clustering● Grouping similar instances into clusters

◆ Association learning● detecting associations between attributes

3

What’s in an example?● Instance: specific type of example● Thing to be classified, associated, or clustered● Individual, independent example of target concept● Characterized by a predetermined set of attributes● Input to learning scheme:

◆ set of instances/dataset◆ represented as a single relation/flat file◆ rather restricted form of input◆ no relationships between objects

4

What’s in an attribute?● Each instance is described by a fixed predefined set

of “features”/“attributes”● Number of attributes may vary in practice● Example of a dataset:

Pre-presbyopicHypermetropeHypermetrope HardNormalYes

NoneReducedYesYoungSoftNormalNoMyopeYoungNoneReducedNoMyopeYoung

Recommended lenses

Tear production rateAstigmatismSpectacle prescriptionAge

5

Classification learning

● Classification learning is supervised● Scheme is provided with actual outcome● Outcome is called the class of the example● When the class type is numeric, it becomes a numeric

prediction task.● Measure success on fresh data for which class labels

are known (test data)

6

Classification Learning

7

Flower ClassificationIris Data Set

• Learn to distinguish 3 different kinds of iris flower:• setosa, versicolor, and virginica

• Extract 4 features: • sepal length, sepal width, petal length, petal 

width8

Flower ClassificationIris Data Set

• Scatter plot – exploratory data analysis9

Clustering

Iris virginica1.95.12.75.8102

101

52

51

2

1

Iris virginica2.56.03.36.3

Iris versicolor1.54.53.26.4

Iris versicolor1.44.73.27.0

Iris setosa0.21.43.04.9

Iris setosa0.21.43.55.1

TypePetal widthPetal lengthSepal widthSepal length

● Finding groups of items that are similar● Clustering is unsupervised● The class of an example is not known● Success sometimes measured subjectively

10

More on attributes● Each instance is described by a fixed predefined set

of “features”/ “attributes”● But: number of attributes may vary in practice● Related problem: existence of an attribute may

depend of value of another one● Possible attribute types (“levels of measurement”):

◆ Nominal and ordinal

11

Nominal quantities● Values are distinct symbols● Values themselves serve only as labels or names● Nominal comes from the Latin word for name● Example:

◆ attribute “outlook” from weather data◆ values: “sunny”, ”overcast”, and “rainy”

● No relation is implied among nominal values (no ordering or distance measure)

● Only equality tests can be performed

12

Ordinal quantities● Impose order on values● But: no distance between values defined● Example:

◆ attribute “temperature” in weather data◆ values: “hot” > “mild” > “cool”

● Note: addition and subtraction don’t make sense● Example rule:

temperature < hot → play = yes● Distinction between nominal and ordinal not always

clear (e.g. attribute “outlook”)

13

Attribute Types (Other terminology)

● Most schemes accommodate just two levels of measurement: nominal and ordinal

● Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”

● Special case: dichotomy (“Boolean” attribute)● Sometimes ordinal attributes include “numeric”, or

“continuous”● But: “continuous” implies mathematical continuity

14

Inaccurate values● Reason: data has not been collected for mining it● Result: errors and omissions that don’t affect original

purpose of data (e.g. age of customer)● Typographical errors in nominal attributes values

need to be checked for consistency● Typographical and measurement errors in numeric

attributes outliers need to be identified● Errors may be deliberate (e.g. wrong zip codes)● Other problems: duplicates, stale data

15

Noise in Data● In general, errors in data are called noise● The dataset is also called noisy data● Missing values and inaccurate values are examples

of noise● Practical machine learning algorithms should be able

to handle noise◆ If the amount of data is large, a small amount of

noise does not affect the true underlying patterns

16