Data Mining – Input: Concepts, instances, attributes Chapter 2.

21
Data Mining – Input: Concepts, instances, attributes Chapter 2

Transcript of Data Mining – Input: Concepts, instances, attributes Chapter 2.

Page 1: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Data Mining – Input: Concepts, instances, attributes

Chapter 2

Page 2: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Concept

• Thing to be learned– Ignore any philosophy about what a concept is– Need description that is

• Intelligible – can be understood, and thus can be argued / discussed as to its validity by humans

• Operational – it can be applied to future examples

• How the concept is expressed is the “concept description”

• Concept may differ based on different styles of learning … classification, association, clustering, numeric prediction …

Page 3: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Styles of Learning

• Classification – learn way of “classifying” unseen examples – put them in the correct category

• Association – learn any association between attributes

• Clustering – seek groups of examples that belong together, without pre-classification

• Numeric prediction – prediction of numeric quantity instead of category

Page 4: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Classification

• “Supervised” – learning scheme is provided correct classification/class/category for “training” data

• Success is measured by trying out what is learned on independent/ previous unseen “test” data (withholding category/class until checking the program’s answer)

Page 5: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Supervision

• Classification and numeric prediction are “supervised”

• Association and Clustering are “unsupervised”

Page 6: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Inputs – What’s in an Example?

• Input is a set of instances (records/examples)• Instance has set of values for pre-determined

attributes (like a record in a DB)• I.e. input is like a single DB table, or “flat file”

– There may be things we’d like to learn that don’t fit into this simple structure – but current technology is largely only up to handling simple input

– You may find it useful sometimes to “denormalize” a DB – do a JOIN of two or more tables to produce a flat file (just make sure you don’t just re-learn the primary keys or foreign key!)

Page 7: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Attributes

• Flat file format means that all examples are expected to have values for the same attributes– Some attributes may be irrelevant for some

examples– Some attributes relevance may depend on value of

another attribute– Usual workaround – irrelevant attributes have a

special irrelevant “value”

Page 8: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Kinds of attributes• Binary/boolean – two valued; e.g. Resident Student?• Nominal/categorical/enumerated/discrete – multiple valued,

unordered; e.g. Major• Ordinal - Ordered, but no sense of distance between –

– e.g. Fr, So, Jr, Sr;

– e.g. Household Income 1 - < 15K, 2 – 15-20K, 3- 20-25K, 4- 25-30K, 5 – 30-40K, 6 – 40-50K, 7 - > 50K

• Interval – ordered, distance is measurable; e.g. birth year• Ratio – an actual measurement with defined zero point -

such that we could say that one value is double another or triple, or ½; e.g. GPA

Page 9: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Kinds of Attributes

• Many algorithms cannot handle all of those different types of attributes

• One approach – – treat binary and nominal as nominal– Treat ordinal, interval, and ratio as “numeric”

• Requires coding ordinals such as Fr, So etc as numbers

Page 10: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Preparing the Data

• Preparing the data “usually consumes the bulk of the effort invested in the entire data mining process”

• Real data is frequently low quality

• Data Cleaning is frequently necessary and time consuming

Page 11: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Preparing the Data

• Integrating data from multiple sources– E.g. data from different departments – marketing,

sales, billing, customer service– E.g. sometimes outside data is valuable – economic

conditions, weather data

• Challenges – different coding conventions, different time periods, different aggregations, different keys, different kinds of errors

• Point of intersection with Data Warehousing – this work needs to be done for BOTH!

• May need to iterate to get right

Page 12: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Preparing the Data

• Standard format – any tool needs data to be in some standard format

• Weka tool requires data to be in ARFF format

Page 13: Data Mining – Input: Concepts, instances, attributes Chapter 2.

ARFF Format

• Lines beginning with % are comments• File starts with name of the relation• Attributes are defined

– Nominal attributes are followed by the set of values

– Numeric attributes list the keyword “numeric”

– No identification of class to be predicted – flexible

• Beginning of data is flagged with @data• Data itself is comma delimited (easily created from

Access or Excel)• Missing values are represented with a ?

Page 14: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Figure 2.2 ARFF file for the weather data.

% ARFF file for the weather data with some numeric features%@relation weather

@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity numeric@attribute windy { true, false }@attribute play? { yes, no }

@data%% 14 instances%sunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yesrainy, 70, 96, false, yesrainy, 68, 80, false, yesrainy, 65, 70, true, noovercast, 64, 65, true, yessunny, 72, 95, false, nosunny, 69, 70, false, yesrainy, 75, 80, false, yessunny, 75, 70, true, yesovercast, 72, 90, true, yesovercast, 81, 75, false, yesrainy, 71, 91, true, no

Page 15: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Data Preparation• You need to understand machine learning schemes before

using them for data mining– Some schemes treat numerics as ordinals and only compare < > =– Others treat numerics as ratios and perform distance and other

measurements

• If distance measurements are to be made, avoid scheme if datasets contain ordinals that distort distances (e.g. income example earlier)

• Distance between nominals is frequently all or nothing (0 or 1)

• If scheme only deals with nominals, any numerics need to be converted to nominals (e.g. age converted to young, mid, old) (some info is lost)

• If dataset has nominals that are coded as integers, don’t confuse the scheme by marking them numeric

Page 16: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Normalization

• Some schemes require all numeric attributes to be on a similar scale – thus normalize or standardize (different term than DB normalization)

• One normalization approach:Norm val = (val – minimum value for attribute)

(max value for attribute – min val)

• One standardization approach:

Stand val = (val – mean) / SD

Page 17: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Missing Values• In real datasets, missing values are frequently coded with

weird value (e.g. –1, 999999)• Sometimes different types of missing values are

distinguished – unknown, vs unrecorded vs not applicable vs …

• Missing values may have meaning – – e.g. maybe income may be left blank more often by people whose

income is particularly high or low– E.g. in diagnosis, a particular test may not need to be done for a

particular case– Get data-knowledgeable person involved

• Most machine learning schemes assume that missing value is not particularly meaningful– If meaningful, need to let scheme know …

Page 18: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Inaccurate Values

• Errors and omissions may be more important to mining algorithms than to source system

• Misspelling of nominal attribute values may suggest incorrect possible values

• Typos or incorrect measurement may yield numeric outliers– Find via graphing / involve data-knowledgeable person

• Duplicate records – confuse scheme by giving heavier weight to

• Deliberate mis-entry occurs (e.g. supermarket checkout entering own bonus card)

Page 19: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Data Age

• We are frequently using data to predict the future

• At some point, the world / business has changed enough that the data is no longer appropriate for that

Page 20: Data Mining – Input: Concepts, instances, attributes Chapter 2.

Getting to Know Your Data

• Several points above reflect this need• Graphic display of data can help find problems (e.g.

outliers, large numbers of unknown value (e.g. 9999), typos of nominals)

• Domain knowledgeable people are valuable – explain anomalies, missing values, coding schemes.

• Data cleaning is extremely important.– At least look at some records to see what is going on

– “Time spent looking at your data is always time well spent”

Page 21: Data Mining – Input: Concepts, instances, attributes Chapter 2.

End Chapter 2