Data Mining – Input: Concepts, instances, attributes Chapter 2.

Data Mining – Input: Concepts, instances, attributes

Chapter 2

Concept

• Thing to be learned– Ignore any philosophy about what a concept is– Need description that is

• Intelligible – can be understood, and thus can be argued / discussed as to its validity by humans

• Operational – it can be applied to future examples

• How the concept is expressed is the “concept description”

• Concept may differ based on different styles of learning … classification, association, clustering, numeric prediction …

Styles of Learning

• Classification – learn way of “classifying” unseen examples – put them in the correct category

• Association – learn any association between attributes

• Clustering – seek groups of examples that belong together, without pre-classification

• Numeric prediction – prediction of numeric quantity instead of category

Classification

• “Supervised” – learning scheme is provided correct classification/class/category for “training” data

• Success is measured by trying out what is learned on independent/ previous unseen “test” data (withholding category/class until checking the program’s answer)

Supervision

• Classification and numeric prediction are “supervised”

• Association and Clustering are “unsupervised”

Inputs – What’s in an Example?

• Input is a set of instances (records/examples)• Instance has set of values for pre-determined

attributes (like a record in a DB)• I.e. input is like a single DB table, or “flat file”

– There may be things we’d like to learn that don’t fit into this simple structure – but current technology is largely only up to handling simple input

– You may find it useful sometimes to “denormalize” a DB – do a JOIN of two or more tables to produce a flat file (just make sure you don’t just re-learn the primary keys or foreign key!)

Attributes

• Flat file format means that all examples are expected to have values for the same attributes– Some attributes may be irrelevant for some

examples– Some attributes relevance may depend on value of

another attribute– Usual workaround – irrelevant attributes have a

special irrelevant “value”

Kinds of attributes• Binary/boolean – two valued; e.g. Resident Student?• Nominal/categorical/enumerated/discrete – multiple valued,

unordered; e.g. Major• Ordinal - Ordered, but no sense of distance between –

– e.g. Fr, So, Jr, Sr;

– e.g. Household Income 1 - < 15K, 2 – 15-20K, 3- 20-25K, 4- 25-30K, 5 – 30-40K, 6 – 40-50K, 7 - > 50K

• Interval – ordered, distance is measurable; e.g. birth year• Ratio – an actual measurement with defined zero point -

such that we could say that one value is double another or triple, or ½; e.g. GPA

Kinds of Attributes

• Many algorithms cannot handle all of those different types of attributes

• One approach – – treat binary and nominal as nominal– Treat ordinal, interval, and ratio as “numeric”

• Requires coding ordinals such as Fr, So etc as numbers

Preparing the Data

• Preparing the data “usually consumes the bulk of the effort invested in the entire data mining process”

• Real data is frequently low quality

• Data Cleaning is frequently necessary and time consuming

Preparing the Data

• Integrating data from multiple sources– E.g. data from different departments – marketing,

sales, billing, customer service– E.g. sometimes outside data is valuable – economic

conditions, weather data

• Challenges – different coding conventions, different time periods, different aggregations, different keys, different kinds of errors

• Point of intersection with Data Warehousing – this work needs to be done for BOTH!

• May need to iterate to get right

Preparing the Data

• Standard format – any tool needs data to be in some standard format

• Weka tool requires data to be in ARFF format

ARFF Format

• Lines beginning with % are comments• File starts with name of the relation• Attributes are defined

– Nominal attributes are followed by the set of values

– Numeric attributes list the keyword “numeric”

– No identification of class to be predicted – flexible

• Beginning of data is flagged with @data• Data itself is comma delimited (easily created from

Access or Excel)• Missing values are represented with a ?

Figure 2.2 ARFF file for the weather data.

% ARFF file for the weather data with some numeric features%@relation weather

@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity numeric@attribute windy { true, false }@attribute play? { yes, no }

@data%% 14 instances%sunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yesrainy, 70, 96, false, yesrainy, 68, 80, false, yesrainy, 65, 70, true, noovercast, 64, 65, true, yessunny, 72, 95, false, nosunny, 69, 70, false, yesrainy, 75, 80, false, yessunny, 75, 70, true, yesovercast, 72, 90, true, yesovercast, 81, 75, false, yesrainy, 71, 91, true, no

Data Preparation• You need to understand machine learning schemes before

using them for data mining– Some schemes treat numerics as ordinals and only compare < > =– Others treat numerics as ratios and perform distance and other

measurements

• If distance measurements are to be made, avoid scheme if datasets contain ordinals that distort distances (e.g. income example earlier)

• Distance between nominals is frequently all or nothing (0 or 1)

• If scheme only deals with nominals, any numerics need to be converted to nominals (e.g. age converted to young, mid, old) (some info is lost)

• If dataset has nominals that are coded as integers, don’t confuse the scheme by marking them numeric

Normalization

• Some schemes require all numeric attributes to be on a similar scale – thus normalize or standardize (different term than DB normalization)

• One normalization approach:Norm val = (val – minimum value for attribute)

(max value for attribute – min val)

• One standardization approach:

Stand val = (val – mean) / SD

Missing Values• In real datasets, missing values are frequently coded with

weird value (e.g. –1, 999999)• Sometimes different types of missing values are

distinguished – unknown, vs unrecorded vs not applicable vs …

• Missing values may have meaning – – e.g. maybe income may be left blank more often by people whose

income is particularly high or low– E.g. in diagnosis, a particular test may not need to be done for a

particular case– Get data-knowledgeable person involved

• Most machine learning schemes assume that missing value is not particularly meaningful– If meaningful, need to let scheme know …

Inaccurate Values

• Errors and omissions may be more important to mining algorithms than to source system

• Misspelling of nominal attribute values may suggest incorrect possible values

• Typos or incorrect measurement may yield numeric outliers– Find via graphing / involve data-knowledgeable person

• Duplicate records – confuse scheme by giving heavier weight to

• Deliberate mis-entry occurs (e.g. supermarket checkout entering own bonus card)

Data Age

• We are frequently using data to predict the future

• At some point, the world / business has changed enough that the data is no longer appropriate for that

Getting to Know Your Data

• Several points above reflect this need• Graphic display of data can help find problems (e.g.

outliers, large numbers of unknown value (e.g. 9999), typos of nominals)

• Domain knowledgeable people are valuable – explain anomalies, missing values, coding schemes.

• Data cleaning is extremely important.– At least look at some records to see what is going on

– “Time spent looking at your data is always time well spent”

End Chapter 2

Data Mining – Input: Concepts, instances, attributes Chapter 2.

Documents

Transcript of Data Mining – Input: Concepts, instances, attributes Chapter 2.