Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

23
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Transcript of Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Data preprocessing before classification

In Kennedy et al.: “Solving data mining problems”

Outline

• Ch.7 Collecting data

• Ch.8 Preparing data

• Ch.9 Data preprocessing

Ch.7 Collecting data

Collecting data

• Collecting “example patterns”– Inputs (vectors of

independent variables)– Outputs (vectors

dependent variables)

• More data is better• Begin with an

elementary set of data

Collecting data

• Choose an appropriate sampling rate for time-series data.

• Make sure the data measurements units are consistent.

• Keep non-essential variables not in the input vector

• Make sure no major structural (systemic) changes have occurred during collection.

Collecting data

• How much data is enough?– Training and testing using a subset of data– If the performance does not increase when full

data is used, data is enough– There are statistical validating methods

(Ch.11)

• Using simulated data– When it is difficult to collect (sufficient) data

• Realistic• Representative

Ch.8 Preparing data

Preparing data

• Handling– Missing data– Categorical data– Inconsistent data and outliers

Missing data

• Discard incomplete example patterns• Manually enter a reasonable, probable, or

expected values• Use an statistic generated from the example

patterns with that value– Mean, mode

• Encode missing values explicitly by creating new indicator variables

• Generate a predictive model to predict each of the missing data value

Categorical data

• Ordinal: – Convert to a numerical representation in a

straightforward manner– “Low”, “medium”, “high” => 0, 1, 2

• Nominal: – “One of n” representation– Encode the input variables as n different

binary inputs, when there are n distinct categories.

Further process of “one of n”

• When n is too large, reduce the number of inputs in the new encoding.– Manually

– PCA-based reduction• Reduce the one-of-n representation to a one-of-m r

epresentation where m is less than n.

– Eigenvalue-based reduction

– Output variable-based reduction

Inconsistent data and outliers

• Removing erroneous data

• Identifying inconsistent data– Thresholding, filtering

• Outliers– Data points that lie outside of the normal regio

n of interest in the input space, which may be• Unusual situations that are “correct”• Misleading or incorrect measurements

Outliers• Ways to spot outliers

– Plot: box plot, histogram…– Number of S.D. from the mean

• Handling outliers– Remove them

• Assumption: the input space where the outliers reside are not concerned

– “Winzorize” them• Convert the values of outliers into the values of upper or low

er thresholds.

• Outliers can always be reintroduced into the satisfying model to study the changes in the performance of the model.

Ch.9 Data preprocessing

Reasons to preprocess data

• Reducing noise

• Enhancing the signal

• Reducing input space

• Feature extraction

• Normalizing data

• Modifying prior probabilities (specific for classification)

Reducing input space

• Principle component analysis (PCA)

– Identify m-dimensional subspace of the n-dimensional

input space

– original n variables are reduced to m variables that are

mutually orthogonal (independent)

• Eliminating correlated input variables

– Identify highly correlated input variables by• Statistical correlation tests

• Visual inspection of graphed data variables

• Seeing if a data variable can be modeled using one or more

others.

Reducing input space

• Combining non-correlated input variables

• Sensitivity analysis– If variations of a particular input variable

cause large changes in the estimation model output, the variable is very significant.

– Sensitivity analysis prunes input variables based on information provided by both input and output data.

Normalizing data

• Not “transform to normal distribution”

• For models that perform better– Non-parametric algorithms implicitly assume d

istances in different directions carry the same weight (e.g. K-nearest neighbor, ”KNN”)

– Backpropagation (BP) and multi-layered perception (MLP) models often perform better if all inputs and outputs are normalized

• Avoiding numerical problems

Types of normalization

• Min-max normalization– It preserves all relationships of the data value

s exactly– It would compress the normal range if extrem

e values or outliers exist

• Z-score normalization

• Sigmoidal normalization

Other considerations

• According to the characteristics of the specific classifiers being used for modeling– E.g. CHAID uses categorical data directly

• Input variables produce the best modeling accuracy when exhibiting a uniform or Gaussian distribution

• Add expert knowledge when preprocessing data

Get prepared and then go!